201q_lect8.pdf

(1)

Unit 8: Linear Regression

QBA 201 – Summer 2013

Instructor: Michael Malcolm

8.1: Single-variable OLS regression

8.2: Inference in single-variable OLS regression

8.3: Multiple-variable OLS regression

(2)

8.1: Single-Variable OLS

In a single-variable regression model, we are interested in the relationship between a dependent variable and an independent variable. For example, we might be interested in how an individual’s wage (the dependent variable) depends on his level of education (the independent variable). We typically denote the dependent variable as 𝑦 and the independent variable as 𝑥.

In a linear regression model, we propose that the true population relationship between 𝑦 and 𝑥 is a linear function:

𝑦𝑖 = 𝛽0+ 𝛽1𝑥𝑖 + 𝑢𝑖

This function is known as the population regression line. As for the notation:

 𝑖 = 1, … , 𝑛 denotes a member of the population.

 𝑦_𝑖 is the dependent variable. This can also be called the regressand or the left-hand variable. For the example above, this is the person’s wage.

 𝑥𝑖 is the independent variable. This can also be called the regressor or the right-hand

variable. For the example above, this is the person’s education level.

 𝛽0 is the intercept. This is the expected value of 𝑦𝑖 when 𝑥𝑖 is equal to zero. For the example above, this is the expected wage for a person with zero years of education.

 𝛽1 is the slope. This is the change in 𝑦 that corresponds to each one-unit increase in 𝑥. For the example above, this is the increase in wage associated with each additional year of education.

 𝑢𝑖 is the error. It captures all of the other things that determine 𝑦𝑖 other than 𝑥𝑖. For the example above, this might include age, experience, location or even just random noise.

The important thing to note here is that 𝛽₀ and 𝛽₁ are the parameters of the true population regression line. These are unknown in practice. We are trying to estimate them. This is the same concept as estimating the true population mean – while the true mean 𝜇 is unknown, we estimate it by using the sample mean 𝑥̅.

(3)

The basic idea behind estimating a linear regression is to choose a line that “fits well” through the given data. We seek to develop an estimated regression line as such:

𝑦̂_𝑖 = 𝛽̂₀+ 𝛽̂₁𝑥_𝑖

Now, unless the data fit exactly on a straight line, then our estimated regression line will not be able to get the value of 𝑦 exactly right for every observation. The residual is the difference between the true value of 𝑦𝑖 and the predicted value 𝑦̂𝑖 on the line:

𝑢̂_𝑖 = 𝑦_𝑖 − 𝑦̂_𝑖

Graphically, the residual is the vertical distance between the true value of 𝑦 and the value given by the estimated regression line. Note that this could be positive or negative.

(4)

It is very important to be clear about the difference between the residual 𝑢̂_𝑖 and the error 𝑢_𝑖. The error is a function of the actual regression line and is unknown. The residual is the observed deviation between the actual value of 𝑦 and the value predicted by the regression line.

Now, the idea in choosing the regression line is to choose a line that keeps the residuals as small as possible. We cannot simply choose a line that minimizes the sum of the residuals, because these could be positive or negative. Rather, we choose a line that minimizes the sum of the

squared residuals over all observations. Precisely, the sum of squared residuals is:

𝑆𝑆𝑅 = ∑𝑛_𝑖=1𝑢̂_𝑖2 = ∑𝑛𝑖=1(𝑦𝑖− 𝑦̂𝑖)2

The ordinary least squares (OLS) regression line is the estimated regression line that minimizes the sum of the squared residuals. That is, we choose the intercept of the line 𝛽̂₀ and we choose the slope of the line 𝛽̂₁ so that the sum of the squared residuals across all observations will be as low as possible.

How do we choose 𝛽̂₀ and 𝛽̂₁ for the OLS regression line? It turns out that there are very simple formulas for the intercept and the slope of the “best-fitting” line through a given set of data, where we are defining “best-fitting” as the OLS regression line that minimizes the sum of the squared residuals across observations. For the OLS regression line:

𝛽̂₁ =𝑠𝑥𝑦_𝑠

𝑥2 and 𝛽̂0 = 𝑦̅ − 𝛽̂1𝑥̅

The slope term for the OLS regression line is just the ratio of the sample covariance to the sample variance of 𝑥. Once we calculate these coefficients, the idea is that the line 𝑦̂𝑖 = 𝛽̂0+

𝛽̂1𝑥𝑖 is the “best-fitting” linear approximation of the relationship between 𝑥 and 𝑦.

There are a few numerical properties of an OLS regression line that are worth noting. Again, recall that the predicted value on the line is 𝑦̂_𝑖 = 𝛽̂₀+ 𝛽̂₁𝑥_𝑖, while the residual is the difference between the actual and the predicted value: 𝑢̂_𝑖 = 𝑦_𝑖 − 𝑦̂_𝑖. When the coefficients are calculated using the formulas above, the following properties always hold:

1. ∑𝑛𝑖=1𝑢̂𝑖 = 0 2. ∑𝑛𝑖=1𝑢̂𝑖𝑥𝑖 = 0

3. (𝑥̅, 𝑦̅) is on the regression line.

Seeing property (3) is easy. Recall that the OLS intercept is:

(5)

If we rearrange this equation slightly, we obtain:

𝑦̅ = 𝛽̂₀+ 𝛽̂₁𝑥̅

This shows that the point (𝑥̅, 𝑦̅) falls exactly on the regression line 𝑦̂_𝑖 = 𝛽̂₀+ 𝛽̂₁𝑥_𝑖.

An important question is how well the regression line fits trough the data. The most common measure of “goodness of fit” is 𝑅2_{. Specifically,}_𝑅2_{is the percentage of the variation in}_𝑦_{that is} explained by 𝑥.

Using our example, where 𝑦 is wage and 𝑥 is education level we know that there is variation in wages across individuals. 𝑅2 is the percentage of variation in wages that is explained by differences in education levels. Geometrically, 𝑅2 roughly shows how well a linear function fits through the data.

𝑅2_{ranges between 0 and 1.}_𝑅2 _{= 1}_{when the data fit perfectly on a straight line, and the}_𝑅2 drops as the linear relationship between 𝑥 and 𝑦 becomes weaker.

To calculate 𝑅2 we need two more definitions. The total sum of squares (TSS) is the total variation in 𝑦 about its mean value:

𝑇𝑆𝑆 = ∑𝑛 (𝑦_𝑖 − 𝑦̅)2 𝑖=1

The explained sum of squares (ESS) is the variation in 𝑦 about 𝑦̅ that is explained by the regression line:

𝐸𝑆𝑆 = ∑𝑛𝑖=1(𝑦̂𝑖− 𝑦̅)2

𝑅2_{is the proportion of the total variation in}_𝑦_{that is explained by the regression line, so:}

𝑅2 ₌ 𝐸𝑆𝑆 𝑇𝑆𝑆

To expand this idea even further, recall the definition of the sum of squared residuals:

𝑆𝑆𝑅 = ∑𝑛𝑖=1(𝑦𝑖− 𝑦̂𝑖)2

(6)

𝑇𝑆𝑆 = 𝐸𝑆𝑆 + 𝑆𝑆𝑅

The way to interpret this is that 𝑇𝑆𝑆 represents all of the variation in 𝑦. 𝐸𝑆𝑆 is the part of this variation that is explained by the regression line, while 𝑆𝑆𝑅 is the part of this variation in 𝑦 that is not explained by the regression line. Indeed, 𝑆𝑆𝑅 measures the deviations in 𝑦_𝑖 about the predicted value from the regression line 𝑦̂_𝑖.

Noting from the identity above that 𝐸𝑆𝑆 = 𝑇𝑆𝑆 − 𝑆𝑆𝑅, we can rewrite the definition of 𝑅2:

𝑅2 ₌ 𝑇𝑆𝑆−𝑆𝑆𝑅

𝑇𝑆𝑆 = 1 − 𝑆𝑆𝑅 𝑇𝑆𝑆

Finally, an interesting numerical property is that 𝑅2 is just the square of the sample correlation between 𝑥 and 𝑦.

𝑅2 _{= 𝑟} 𝑥𝑦2

Microsoft Excel can be used for implementing OLS regression estimates. After clicking “data analysis” under the “data” tab, choose “regression”.

(7)

For now, the important thing to note is that the coefficients of the OLS regression equation are given in the last two rows. The intercept is 𝛽̂₀ = −1.6047 and the slope term is 𝛽̂₁ = 0.8139. Combining, the estimated regression line is:

𝑦̂_𝑖 = −1.6047 + 0.8139𝑥_𝑖

This is the best-fitting linear function to approximate the relationship between education level 𝑥𝑖 and wage 𝑦_𝑖. Practically, what this means is that each one-year increase in education 𝑥_𝑖 is associated with an increase of $0.8139 in wage 𝑦_𝑖.

One can use this regression line to predict wage for individuals with a particular education level. For example, for an individual with 𝑥_𝑖 = 12 years of education, our best prediction of his wage is:

𝑦̂𝑖 = −1.6047 + 0.8139 ⋅ 12 = 8.1621

But if the individual obtains one more year of education, then our best prediction of his wage rises by 0.8139 to:

𝑦̂_𝑖 = −1.6047 + 0.8139 ⋅ 13 = 8.9760

In using a regression line for prediction purposes, one important note is that you can only make reasonable predictions for values of 𝑥 that are not too far away from those in the sample. For example, the highest level of education in our sample is 𝑥 = 18, so the model might not be reliable for generating a predicted wage for a PhD graduate with 𝑥 = 23 years of education.

(8)

EXERCISES

1. A researcher uses a sample of 100 classrooms with data on class size (CS) and on average test scores (T) to estimate the following regression: 𝑇̂ = 520.4 − 5.82 ⋅ 𝐶𝑆

a. A classroom has 22 students. What is the prediction for the average test score? b. Last year a classroom had 19 students and this year it has 23 students. What is

your prediction for the change in the classroom average test score?

c. The sample average of class size across 100 classrooms is 21.4. What is the sample average of the test score across the 100 classrooms?

2. Education researchers often wonder about the factors associated with student evaluations of professors. One question is whether women are “stereotyped” by their physical attractiveness. A researcher conducts a study of 463 classes taught by female professors. He has a panel of judges rate each professor’s beauty on a scale of 1-5. He then regresses the professor’s average student evaluation score (also on a scale of 1-5) on her beauty score. The results are below.

a. Write out the regression equation.

b. Interpret the slope and the intercept of the regression in words.

(9)

8.2: Inference in single-variable OLS regression

The goal of OLS regression is to develop estimates for the coefficients of the true population regression 𝑦 = 𝛽0+ 𝛽1𝑥 + 𝑢. That is, we use the OLS formulas to develop an estimated regression line 𝑦̂ = 𝛽̂₀+ 𝛽̂₁𝑥.

The story is similar to estimating a population mean. When we estimate a population mean 𝜇, we develop an estimate 𝑥̅. Of course, the estimate will not be exactly correct, and in fact the central limit theorem describes the distribution of 𝑥̅ about the true population mean 𝜇. This is what hypothesis tests and confidence intervals are based on.

In the case of a regression, the object of interest is almost always the slope term 𝛽₁. That is, we are interested in the association between the independent variable 𝑥 and the dependent variable

𝑦. The sample coefficient 𝛽̂1 is used to estimate the true population coefficient 𝛽1. In the same way as estimating a population mean, the sample coefficient 𝛽̂₁ will not be exactly correct, since it is an estimate based on a sample from the population regression with true coefficient 𝛽1. In fact, the central limit theorem applies, and the distribution of 𝛽̂1 is normal with mean 𝛽1 and with standard deviation 𝑠𝑒(𝛽̂₁). The calculation of 𝑠𝑒(𝛽̂₁) is complicated, but luckily it can be implemented automatically on the computer.

To test hypotheses involving the value of 𝛽₁, we use exactly the same setup that we used earlier. The null is that 𝛽₁ has some particular value 𝛽₁0. We can test this against the alternative that 𝛽₁ is greater than, less than or not equal to some alternative value. The test statistic is usually called a t-statistic, and it is calculated in the usual way:

𝑡 =𝛽̂1−𝛽10 𝑠𝑒(𝛽̂1)

Using a test with size 𝛼, the rejection regions are calculated in the usual way. The t-distributions for finding the relevant rejection regions use 𝑛 − 2 degrees of freedom.

This is the hypothesis test for a coefficient value in general. However, by far the most conventional case is to test the null hypothesis that 𝛽₁ = 0 against the alternative hypothesis that

𝛽₁ ≠ 0. That is, the research objective is to determine whether there is an association between 𝑥 and 𝑦 and specifically whether this relationship is statistically significant. The normal test is:

𝐻0: 𝛽1 = 0

𝐻_𝑎: 𝛽₁ ≠ 0

(10)

𝑡 = 𝛽̂1 𝑠𝑒(𝛽̂1)

We then determine whether this test statistic falls in the rejection region.

The Excel output for a regression estimation gives both the standard error 𝑠𝑒(𝛽̂1) and also gives

the t-statistic 𝑡 = 𝛽̂1

𝑠𝑒(𝛽̂1) as calculated above. It also gives the p-value for the two sided hypothesis test of the null hypothesis 𝐻₀: 𝛽₁ = 0 against the alternative that 𝐻_𝑎: 𝛽₁ ≠ 0.

One note – Excel (and other statistics software) uses the t-distribution in evaluating hypothesis tests. For large sample sizes, it doesn’t really make a difference because the t-distribution and the z-distribution are very close to each other when the degrees of freedom are large. However, for small sample sizes, the same warning applies as with all tests that use the t-distribution. In this case, the test is only reliable when the true underlying process obeys a normal distribution. If it does not, then you can’t do any inference using standard techniques when the sample size is too small.

As an example, suppose that a company has 10 sales locations, and it is interested in testing the relationship between the number of sales that the company gets and the distance from the sales location to the center of town. In this case, the dependent variable 𝑦 measures the average number of sales each day and the independent variable 𝑥 measures distance from the town in miles.

(11)

𝑦̂𝑖 = 77.3495 − 4.2543𝑥𝑖

Or, in words, each 1-mile increase in distance lowers the expected number of daily sales by 4.2543 sales.

The statistical question, now, is whether this relationship is statistically significant. The standard error for the slope coefficient is given in the output: 𝑠𝑒(𝛽̂₁) = 1.5717. The t-statistic for the test

𝐻₀: 𝛽₁ = 0 against the alternative that 𝐻_𝑎: 𝛽₁ ≠ 0 can then be calculated as such:

𝑡 = 𝛽̂1 𝑠𝑒(𝛽̂1)=

−4.2543

1.5717 = −2.71

Now, since the sample size is 𝑛 = 10, our hypothesis test should use the t-distribution with 𝑛 −

2 = 8 degrees of freedom. For a two-sided test at 5% significance level, the rejection region is:

RR: 𝑡 > 2.306 or 𝑡 < −2.306

Since our calculated t-statistic falls in the rejection region, we can reject 𝐻₀: 𝛽₁ = 0 against the alternative 𝐻𝑎: 𝛽1 ≠ 0 at a 5% level of significance. In words, we say that distance is a “significant determinant” of sales, meaning that we can conclude that the calculated effect of distance on sales is different from zero.

There is an easier way to see this, though. By using the Excel output, we can immediately see that 𝑝 = 0.0268 for the hypothesis test described above. Thus, we can reject the null hypothesis at a 5% significance level, but we could not reject it at a 1% significance level. Remember that the p-value is the lowest level of significance at which the null hypothesis can be rejected.

We can also construct a confidence interval for the true coefficient value. A 1 − 𝛼 confidence interval for the true 𝛽₁ is given by:

[𝛽̂₁− 𝑡_{𝛼 2}⁄ ⋅ 𝑠𝑒(𝛽̂1), 𝛽̂1+ 𝑡𝛼 2⁄ ⋅ 𝑠𝑒(𝛽̂1)]

Again, the relevant t-distribution is with 𝑛 − 2 degrees of freedom. For this example, a 95% confidence interval is:

(12)

Notice in the final two columns of the last line that Excel automatically calculates a 95% confidence interval for the coefficient. Since this confidence interval contains only negative values, we can be fairly sure that distance has a negative effect on sales.

It is important to note again that, for small sample sizes, the accuracy of these tests depends critically on the assumption of normality. For large sample sizes, the central limit theorem guarantees that the distribution of the sample coefficient will be approximately normal, but the tests for small sample sizes are only accurate when the true data-generating process is normal.

There is one final important point about interpretation. A linear regression estimates the numerical association between 𝑥 and 𝑦. The slope coefficient tells us how changes in 𝑥 are associated with changes in 𝑦. This is very different from saying that changes in 𝑥cause changes in 𝑦. Indeed, it is very wrong to interpret a significant regression coefficient by saying that a 1-unit change in 𝑥causes a certain change in 𝑦. Here are a few examples.

 Suppose you are interested in the effect of adding more police on a city’s crime rate. Letting 𝑦 measure the crime rate and letting 𝑥 measure the number of police, you may be surprised to find that the coefficient is positive, not negative. Does this mean that increases in police presence cause the crime rate to increase? Certainly not. What is happening is that, in places with higher crime rates, politicians hire more police. All the regression tells us is that when 𝑥 is high, 𝑦 is high at the same time. This is very different from showing that 𝑥 causes 𝑦.

 When you regress wages on education, you find a positive coefficient – higher education is associated with higher wages. Does this mean that being more educated causes you to earn a higher wage? Maybe. On the other hand, it’s also true that more educated people are more intelligent and come from wealthier families, on average. So maybe it’s not that education is causing people to earn higher wages, but just that more educated people get higher wages because they also tend to be smarter and come from wealthier families, and these things fetch higher wages.

(13)

EXERCISES

1. A researcher uses a sample of 100 classrooms with data on class size (CS) and average test scores (T) to estimate the following regression. Standard errors appear in parentheses below the estimated coefficient.

𝑇̂ = 520.4 − 5.82 ⋅ 𝐶𝑆

(20.4) (2.21)

a. Construct a 95% confidence interval for the true slope 𝛽₁.

b. Conduct a hypothesis test for the null 𝛽₁ = 0 against the standard two-sided alternative. Do you reject the null at a 5% level of significance?

c. Compute the p-value for your test in (b).

d. Conduct a hypothesis test for the null 𝛽1 = −5.6 against the standard two-sided alternative. Do you reject the null at a 5% level of significance?

e. Compute the p-value for your test in (d).

f. Is −5.6 contained in the 95% confidence interval for 𝛽1? Is 0? Relate this to your answers in (b) and (d).

2. A researcher estimates the following regression of wage (W) on years of education (E) and obtains the following results. Standard errors appear in parentheses below the estimated coefficient.

𝑊̂ = −3.13 + 1.48 ⋅ 𝐸

(0.93) (0.07)

a. A randomly selected worker reports 16 years of education. What is his expected wage?

b. A high school graduate is contemplating whether to obtain a four-year degree. How much is his wage expected to rise?

(14)

8.3: Multiple-Variable OLS Regression

The setup in this case is similar to the single-variable case, except that now the dependent variable 𝑦 is a function of many independent variables.

𝑦_𝑖 = 𝛽₀+ 𝛽₁𝑥_1𝑖+ 𝛽₂𝑥_2𝑖+ ⋯ + 𝛽_𝑘𝑥_𝑘𝑖+ 𝑢_𝑖

The important difference here is the following. Suppose that 𝑦 is wage, while 𝑥₁ is education and

𝑥₂ is IQ. In this case, 𝛽₁ is the increase in wage associated with a 1-unit increase in education

while holding all other variables constant. This is important, because changes in one variable might be associated with changes in another variable. For this example, 𝛽₁ is the increase in wage associated with one more year of education, while holding IQ constant. In other words, for two people with the same IQ, what is the estimated increase in wage from a one year increase in education?

In this case, we say that we are “controlling” for IQ, meaning that we are looking at the effect of differences in education on wages while holding IQ constant. This is important because, in the single-variable regression of wage on education (without IQ), the coefficient may be too high because it is picking up not only the effect of education but also is picking up the effect of IQ since people with higher IQ typically have more education. By controlling for IQ, we isolate the effect of education by looking at the estimated impact of changes in education, but while holding IQ constant.

As in the single variable case, our goal is to develop an estimated regression line:

𝑦̂_𝑖 = 𝛽̂₀+ 𝛽̂₁𝑥_1𝑖+ 𝛽̂₂𝑥_2𝑖+ ⋯ + 𝛽̂_𝑘𝑥_𝑘𝑖

The residual 𝑢̂_𝑖 is the difference between the fitted value 𝑦̂_𝑖 predicted by the regression line and the actual value of 𝑦𝑖.

𝑢̂_𝑖 = 𝑦_𝑖 − 𝑦̂_𝑖

As before, unless the relationship between 𝑦 and {𝑥₁, 𝑥₂, … , 𝑥_𝑘} in the data is perfectly linear, there will always be some residual. The OLS estimates are chosen to minimize the sum of these squared residuals:

(15)

Unfortunately, unlike in the single-variable case, there is no simple formula for expressing the coefficients {𝛽̂₀, 𝛽̂₁, … , 𝛽̂_𝑘} that solve this problem (it requires matrix notation). Luckily, the computer can find the coefficients for us.

We can compute 𝑅2 in the same way as for a single-variable regression:

𝑅2 ₌ 𝐸𝑆𝑆

𝑇𝑆𝑆= 1 − 𝑆𝑆𝑅 𝑇𝑆𝑆

The interpretation is the same as in the single-variable case. 𝑅2 is the fraction of the variation in

𝑦 that is explained by the independent variables {𝑥1, 𝑥2, … , 𝑥𝑘}.

In multiple-variable models, researchers often are interested in “model selection”. That is, we want to know which variables are important, and thus which to include in the regression. The basic answer is that the specification should be guided by some kind of theory. Nevertheless, it can sometimes be helpful to determine which among multiple possible models is the “best fit”.

One weakness with 𝑅2 in the model selection context is that 𝑅2 always increases when new regressors are added to the model. That is, if run a regression of 𝑦 on 𝑚 regressors, and then you keep those same 𝑚 regressors but add one mode as well, 𝑅2 will always increase. This is undesirable because we don’t like to add variables to our model unless they have a reasonable amount of explanatory power.

In other words, if you use 𝑅2 to select among models, it will always select the model with more variables, regardless of whether these additional variables are actually important. One solution to this problem is adjusted 𝑹𝟐, or 𝑅̅2, which is defined as follows:

𝑅̅2 _{= 1 − (} 𝑛−1 𝑛−𝑘−1) (

𝑆𝑆𝑅 𝑇𝑆𝑆)

The inclusion of 𝑘 in this formula is in essence a penalty for adding new variables. If a new variable is added that does not have very much explanatory power, then 𝑘 rises but 𝑆𝑆𝑅 only falls a little bit, so 𝑅̅2 will actually fall. On the other hand, when a new variable is added that is important and has a lot of explanatory power, then 𝑘 rises but 𝑆𝑆𝑅 falls by a large amount. Thus,

𝑅̅2_{will rise in this case. Basically, when a new variable is added,}_𝑅̅2_{will only rise if the new} variable has “enough” explanatory power. Thus, 𝑅̅2_{is a more reliable criterion for model} selection when deciding whether to add additional variables.

(16)

Recall from unit 8.1, in the single-variable regression of wages on education we obtained:

𝑤𝑎𝑔𝑒̂ = −1.6047 + 0.8139 ⋅ 𝑒𝑑𝑢𝑐

Reading from the table above, our new results are:

𝑤𝑎𝑔𝑒̂ = −5.5637 + 0.9769 ⋅ 𝑒𝑑𝑢𝑐 + 0.1037 ⋅ 𝑒𝑥𝑝𝑒𝑟

What is going on here? The correlation between education and experience is negative since older workers tend to have fewer years of education on average. In general, young people today tend to go to school for longer than their counterparts 30 or 40 years ago. Thus, when wage is regressed only on education, part of the problem is that the workers with more education also tend to be the younger and less experienced workers, which causes their wage to drop. Once we control for experience, then the estimated effect of education on wages is stronger. In other words, in the second regression, the coefficient 0.9769 tells us that each additional year of education is associated with an increase in wage of 0.9769, given two workers with the same experience levels. But in the first regression, the coefficient of 0.8139 tells us that each additional year of education is associated with a wage increase only of 0.8139. But this is not controlling for experience – these more educated workers are probably the less experienced workers as well.

(17)

𝑎𝑏𝑢𝑠𝑒̂ = 𝑐 + 1.1550 ⋅ 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡 + 0.7105 ⋅ 𝑝𝑜𝑣𝑒𝑟𝑡𝑦

The difficulty with interpreting this regression is that rich areas overall have better social services and more money to spend on child welfare programs. So maybe it’s not that unemployment and poverty cause child abuse to be high, but rather that these areas are the areas with poor social services, and this is the reason that child abuse rates are higher in these areas.

Indeed, if we control for spending on child welfare social services, we obtain the following:

𝑎𝑏𝑢𝑠𝑒̂ = 𝑐 + 0.5146 ⋅ 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑚𝑒𝑛𝑡 + 0.5925 ⋅ 𝑝𝑜𝑣𝑒𝑟𝑡𝑦 − 0.0175 ⋅ 𝑠𝑝𝑒𝑛𝑑𝑖𝑛𝑔

As we guessed, the estimated effect of economic circumstances on child abuse drops substantially once we control for spending on child welfare programs. In other words, the effect of economic circumstances on child abuse rates are not as strong as it would have appeared at first glance. Once we realize that poor areas are also areas with bad social support programs, we realize that part of this effect is weak social support systems, not being poor per se.

(18)

The coefficient on gender shows that a change in GENDER from 0 to 1 is associated with a 2.32 decline in wage. That is, holding constant the level of education and experience, the expected hourly wage for a woman is $2.32 lower than the expected wage for a man.

Note that if we had coded the variable instead as 0 for female and 1 for male, then the coefficient would have been +2.32 rather than −2.32. In other words, holding education and experience constant, being male is associated with earning a wage that is $2.32 higher.

Does this prove that there is gender discrimination in wages? Recalling the previous section, correlation is a very different thing from causation. What we have shown is that, at least in this data, it is conclusive that women earn less than men with the same education and experience levels. But there are other things – maybe women hold different kinds of jobs on average, for example. Showing a correlation between gender and wage does not prove causation.

For dummy variables that have more than two classifications, there is one warning. For example, suppose that you are looking at a company’s sales on different days and you want to include and independent control variable for the season – maybe the demand is seasonal, so it is important to control for the season during which the observation is taken. This is a dummy variable with four levels since the observation could be taken either during the fall, winter, spring or summer. In this case, your first thought might be to create four variables:

𝑥₁ = 1 if observation is in the fall, 0 otherwise

𝑥2 = 1 if observation is in the winter, 0 otherwise

𝑥3 = 1 if observation is in the spring, 0 otherwise

𝑥₄ = 1 if observation is in the summer, 0 otherwise

(19)

EXERCISES

1. Data were collected from a random sample of 220 home sales from a community in 2003. Let 𝑃 denote the selling price (in thousands of US dollars). 𝐵𝐷 denotes the number of bedrooms in the house. 𝐵𝐴 denotes the number of bathrooms. 𝐻𝑆 denotes the size of the house (in square feet). 𝐿𝑆 denotes the size of the lot (in square feet). 𝐴 denotes the age of the house (in years). 𝑃𝑅 is a dummy variable equal to 1 if the house is in poor condition. The estimated regression is:

𝑃̂ = 119 + 0.485 ⋅ 𝐵𝐷 + 23.4 ⋅ 𝐵𝐴 + 0.146 ⋅ 𝐻𝑆 + 0.002 ⋅ 𝐿𝑆 + 0.090 ⋅ 𝐴 − 48.8 ⋅ 𝑃𝑅

a. A homeowner converts part of an existing living room into a new bathroom. What is the expected increase in the value of the house?

b. A homeowner adds a new bathroom by putting on an extension that adds 100 square feet to the house. What is the expected increase in the value of the house? c. A homeowner converts a bedroom into a bathroom. What is the expected increase

in the value of the house?

d. What happens to the value of the house if the homeowner lets the condition become poor?

(20)

8.4: Inference in multiple-variable OLS regression

Hypothesis tests and confidence intervals related to regression coefficients are operationalized in the same way as in the single-variable case. The Excel output reports the standard error for each coefficient. As an example, suppose that our company with 10 sales locations is interested in the determinants of sales. The researcher regresses the location’s average sales on three factors: the price charged for the product, the level of advertising and the distance between the sales location and the city center. The following are the results.

The estimated regression line is:

𝑆𝑎𝑙𝑒𝑠̂ = 135 − 0.14 ⋅ 𝑃𝑟𝑖𝑐𝑒 + 0.54 ⋅ 𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑖𝑛𝑔 − 5.78 ⋅ 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒

The researcher is now interested in determining which of these variables are significant determinants of sales.

(21)

In this example, note that the p-value for advertising is 𝑝 = 0.43. This means that, at conventional levels of significance like 1% or 5%, we are not able to reject the null hypothesis that the coefficient on advertising is equal to zero. That is, this regression does not provide evidence that advertising has a statistically significant effect on sales.

By contrast, the p-value for distance is 𝑝 = 0.003. Since this is lower than our conventional significance levels like 1% or 5%, this means that we can reject the null hypothesis that the coefficient on distance is equal to zero. In other words, the researcher has good evidence that distance from the city center does have a significant effect on sales.

In the case of the price coefficient, the p-value is 𝑝 = 0.059. This is above 5% but less than 10%. Thus, if we are interested in testing whether price has a statistically significant on sales, this relationship is statistically significant at a 10% level of significance, but not at a 5% level of significance.

To see where these p-values come from, let us analyze the coefficient on price in a bit more detail. Recall that the hypothesis test is testing whether the true coefficient is equal to zero.

𝐻₀: 𝛽_𝑖 = 0 versus 𝐻_𝑎: 𝛽_𝑖 ≠ 0

The t-statistic for this hypothesis test is the ratio of the coefficient value to the standard error:

𝑡 =−0.1431_0.0594 = −2.41

Since tests for regression coefficients are conventionally two sided, the rejection region for a test with a size of 5% is:

𝑡 > 𝑡0.025 or 𝑡 < −𝑡0.025

Now, the appropriate degrees of freedom for this test is 𝑛 − 𝑘 − 1. The reason is that we are estimating 𝑘 + 1 parameters – 𝑘 slope terms and also the intercept, and we lose one degree of freedom for each parameter being estimated. In this case, there are three independent variables and our sample size is 𝑛 = 10, so the appropriate degrees of freedom is 10 − 3 − 1 = 6. Reading from the t-table shows that the corresponding rejection region is:

𝑡 > 2.447 or 𝑡 < −2.447

(22)

𝑡 > 1.943 or 𝑡 < −1.943

In this case, our t-statistic does fall into the rejection region, so we are able to reject the null hypothesis that the coefficient on price is equal to zero. Precisely, the computer reports the p-value 𝑝 = 0.059, which confirms what we just showed that the null hypothesis can be rejected at a 10% level of significance but not at a 5% level of significance.

The 95% confidence interval for the true coefficient is computed in the normal way:

[𝛽̂𝑖 − 𝑡𝛼 2⁄ ⋅ 𝑠𝑒(𝛽̂𝑖), 𝛽̂𝑖 + 𝑡𝛼 2⁄ ⋅ 𝑠𝑒(𝛽̂𝑖)]

Thus, for a 95% confidence interval for the coefficient on price, we have:

[−0.1431 − 2.447 ⋅ 0.0594, −0.1431 + 2.447 ⋅ 0.0594] = [−0.288, 0.002]

Intuitively, since this confidence interval contains both positive and negative values, we were not able to reject at a 5% level of significance that the true coefficient is 0 – in other words, that price has no effect on sales. The coefficient value 0 is contained within our 95% confidence interval.

There are two other pieces of information in the regression output that you should understand. The standard error of the regression is an estimate for the standard deviation of the true, unknown population error 𝑢_𝑖. It is calculated as:

𝑆𝐸𝑅 = √_{𝑛−𝑘−1}𝑆𝑆𝑅

Finally, the F-statistic relates to a joint test for the significance of all coefficients. Formally, the hypotheses are:

𝐻0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0 versus 𝐻𝑎: At least one coefficient ≠ 0

The F-statistic is calculated as:

𝐹 =_{𝑆𝑆𝑅 (𝑛−𝑘−1)}𝐸𝑆𝑆 𝑘_⁄ ⁄

(23)

favor of the alternative hypothesis that at least one coefficient is nonzero – in other words, that at least one of the independent variables is a significant determinant of sales.

There are a few other issues related to OLS regression that I will mention here briefly. In a more advanced course covering linear models, such as econometrics, you will deal with these issues in much more detail.

 Multicollinearity: Suppose you run a regression of wage on both age and experience. The problem is that age and experience are highly correlated with each other – most of the time, individuals who are old also have a higher number of years of job experience. The coefficient estimates will probably be imprecise (i.e. high standard errors) because it is difficult to sort out the effect of age from the effect of experience since both normally increase together. What to do about multicollinearity is a persistent disagreement. Some textbooks will tell you that, if two independent variables are highly correlated, one should be dropped. Most economists disagree and would tell you not to throw out important variables even if there is multicollinearity. Getting the model correct is more important than getting low standard errors.

 Omitted Variables: If a variable 𝑍 is a true determinant of 𝑌 and if it is correlated with other independent variables in the regression, then the coefficients will be biased if 𝑍 is not included in the regression. For example, regressing wage on education produces an estimated coefficient that is too high if one fails to control somehow for IQ and family status. The coefficient on education in this case is “picking up” the effects of all the other things that are correlated with wages and that also influence wages.

(24)

EXERCISES

1. A researcher is interested in determining whether students who live far away from colleges get fewer years of education on average because of transportation problems. First, he runs a simple regression of years of education on distance to the nearest college. Then he runs a regression of years of education on distance to the nearest college and many other control variables, as described below. The output for both regressions is given on the next page.

a. Interpret the coefficient on DISTANCE from the single-variable regression. b. Interpret the coefficient on DISTANCE from the multiple-variable regression. c. How can you explain the difference between (a) and (b)?

d. Is DISTANCE a significant determinant of education level for the single-variable regression?

e. Which variables in the multiple-variable regression are significant determinants (at significance level 𝛼 = 0.05) of education levels?

f. Compare the fit of the two regressions.

g. Interpret the value of the coefficient on DAD COLLEGE precisely in words. h. Use economics to explain the sign of the coefficients on UNEMPLOYMENT and

MAN WAGE.

i. If you take two otherwise identical students, one of whom lives 20 miles from the nearest college and the other one lives 60 miles from the nearest college, what is your best prediction about the difference in their education levels?

Variable Description

DISTANCE Distance from nearest college, in miles

BLACK Dummy, =1 if black

TEST High school achievement test score (out of 100)

UNEMPLOYMENT City’s unemployment rate

DAD COLLEGE Dummy, =1 if father is a college graduate

FEMALE Dummy, =1 if female

HISPANIC Dummy, =1 if Hispanic

HIGH INCOME Dummy, =1 if family income is in top 25%

OWN HOME Dummy, =1 if family owns home

(25)

(26)

2. A regression was estimated using a random sample of 4000 full-time workers. COLLEGE is a dummy variable equal to 1 if the worker graduated from college. FEMALE is a dummy variable equal to 1 if the worker is female. A researcher regresses wage on these two independent variables and obtains the following. Standard variables are given in parentheses below the coefficient values:

𝑤𝑎𝑔𝑒̂ = 12.69 + 5.46 ⋅ 𝑐𝑜𝑙𝑙𝑒𝑔𝑒 − 2.64 ⋅ 𝑓𝑒𝑚𝑎𝑙𝑒

(0.14) (0.21) (0.20)

a. Interpret the two slope coefficients precisely in words.

b. Is the wage premium from attending college significantly different from 0 at a 5% level of significance?

c. Is the male-female earnings difference significantly different from 0 at a 5% level of significance?

3. Suppose now that the researcher adds age and geographical location to the regression above. There are four geographical regions in the US: NEAST (Northeast), MWEST (Midwest), SOUTH and WEST. The regression results are as follows:

𝑤𝑎𝑔𝑒̂ = 3.75 + 5.44 ⋅ 𝑐𝑜𝑙𝑙𝑒𝑔𝑒 − 2.62 ⋅ 𝑓𝑒𝑚𝑎𝑙𝑒 + 0.29 ⋅ 𝑎𝑔𝑒 +0.69 ⋅ 𝑁𝐸𝐴𝑆𝑇 + 0.60 ⋅ 𝑀𝑊𝐸𝑆𝑇 − 0.27 ⋅ 𝑆𝑂𝑈𝑇𝐻

a. Why was WEST omitted from the regression?

b. Kelly is a 28 year-old female college graduate from the South. Jennifer is a 28 year-old female college graduate from the West. Calculate the expected difference in their wages.