Chapter 5: Basic Linear Regression

(1)

Chapter 5: Basic Linear Regression

1. Why Regression Analysis Has Dominated Econometrics

By now we have focused on forming estimates and tests for fairly simple cases involving only one variable at a time. But the core task of the human sciences is to study the simultaneous interrelationships among several variables. How will an increase in price affect quantity demanded, how will law enforcement affect deviant behavior, how will a change in the Federal deficit affect inflation—these are all questions about the effect of one variable upon another. The only tool we’ve discussed for such questions is correlation, and we’ve seen that it has serious drawbacks.

If we were doing an experimental natural science, we might solve this problem by conducting controlled experiments, in which we only change one variable at a time. This might isolate the relationships among variables without needing too much statistical artillery. But usually this isn’t an option in the human sciences. So we must develop a statistical “virtual laboratory,” a means of coordinating our data so that we can draw conclusions as if the data had been generated by a controlled experiment, teasing out the unique effect of each variable. And regression analysis is the most common approach used by economists to construct such a virtual laboratory to explore simultaneous relationships among several variables. We’ll start with the simplest case: one variable affecting one other variable. In the next chapter we consider the more realistic case, where several independent variables affect a dependent variable.

2. The Basic Regression Paradigm

Let’s say that I lead you out to the western sideline of the college soccer field tomorrow morning at 9:00 a.m. with the following instructions:

Last evening at 10:00 p.m. I buried a tiny 1cm canister under 1/2cm of soil, somewhere along the opposite edge of this field. The canister contains the entire college endowment--$25,000,000. It’s yours if you walk directly to it and pick it up on your first try.

As I buried the money, I marked the path from this side of the field to the canister with a narrow piece of double-sided tape. Unfortunately, this tape completely disintegrates 1 hour after use. But,

fortunately for you, the tape was covered with the eggs of a large Amazonian flea. These fleas hatch, take one leap, and die.

Your mission, therefore, is to reconstruct the position of the tape, using the flea carcasses as your guide.

Why would I do this? OK, I’m a little eccentric sometimes. But this exercise would exactly parallel the duties of regression analysis:

We believe, for theoretical reasons, that there is some relationship in the world between two variables—say price and quantity demanded at a Farmers’ Market booth on a Saturday morning:

(Notice that I’ve put the dependent variable on the vertical axis. Looks like tape on a field, no? )

Unfortunately, the intercept and slope of the line that measures this relationship are population parameters—we can not directly observe them. Instead we observe samples from this population, and

(2)

(This kind of picture is called a “scatter plot” or “sample scatter diagram.”)

There will be “random errors” that move our observations away from the regression line, because some things we’ve left out of our model also affect quantity demanded (like changes in income), the population relationship may not be perfectly linear (as we’re assuming here),

we may make some errors in measuring the two variables, and

there may simply be a truly random, nondeterministic component to demand.

In any event, the thing we observe is a cloud of measurement around the population regression line, not the line itself.

From this cloud of data, our job is to make the best guess about where the invisible population line actually lies:

How to proceed? The majority of econometric work approaches this in typical economist fashion: Let’s reduce this problem to a simpler one by making several simplifying assumptions. Then in the succeeding chapters we will learn how to discover whether each assumption is actually justified, identify the problems that arise when each assumption is violated, and try to construct a fix for these problems. It will be a bit like practicing medicine: Whenever you do regression analysis, you’ll fall into the pattern of diagnosing violations of assumptions, recognizing the symptoms of these violations, prescribing a treatment, and monitoring the results.

That’s the most common approach to doing econometrics. We should mention that there are several less-common approaches to the basic problem of inferring population relationships from sample observations— non-parametric estimation, vector-autoregressive models, and others. These are usually considered beyond the scope of an undergraduate course (and even many graduate courses) but might make good term-paper topics for the right person.

(3)

If we were out on the soccer field trying to reconstruct the position of the double-sided tape, we’d have to make some assumptions about how the tape was placed and how Amazonian fleas jump when they hatch. The same is true of reconstructing a population regression line from our sample observations. Eight basic assumptions have proven useful:

Assume that the relationship between the variables is linear. 5.1.1 We’ll find easy ways to ease this assumption, but it’s a helpful place to start. In symbols, we’re assuming that the population relationship looks like this:

X

Y

5.1.2

That is, the dependent variable is a linear function of an independent variable, plus-or-minus a random error that we are calling (the Greek lower-case epsilon). and are known as the “regression coefficients.” represents the marginal effect of

X

upon

Y

, the slope of the regression line. represents the intercept of the line, which incorporates the effect upon

Y

of all the variables that influence

Y

but do not appear in our equation. (For a demand curve, these would include income and population size). If any of these absent variables were to change, would also change—the entire line would shift.

To be a bit picky, Equation 5.1.2 is a reference to the entire population relationship, the whole line. We’d denote any particular occurrence of the dependent variable along this line with lower-case letters bearing subscripts,

n n

n

x

y

, 5.1.3

where the lower-case n subscript indicates that this particular occurrence of Y is one of the (upper-case)

N occurrences we’ve measured. In pictures:

Remember: It’s usually lower-case letters for references to individual observations, upper-case letters for references to totalities.

Of course we never actually observe the or the because they’re population parameters, so we can never actually know the population line’s position or measure the _n(the distance of our observation from the population regression line). Instead we will eventually make estimates of these items, constructing a sample regression line that we’ll denote

e

X

Y

ˆ

5.1.4

So you might say that the basic regression problem is to calculate estimates

a

ˆ

,

b

ˆ

and

e

ˆ

for the population parameters , and . We’ll also want to calculate the standard errors of these three estimators, so we can conduct hypothesis tests and build confidence intervals.

(4)

Having explained the first major assumption and clarified notation, the remaining seven assumptions will go more quickly.

Assume that the error term is a random variable with a mean of zero:

E

(

_n

)

0

. 5.2.1 This assumes that, on average, the error terms cancel each other out—some positive (placing our observations above the true regression line), some negative, in roughly equal numbers and distances. In other words, the expected value of y depends upon the value of x, and this conditional mean of y is equal to

x

, given a value of x.

On the soccer field, this is like assuming that the fleas jump, in roughly equal numbers, toward either goal post. If you assumed this but they were instead magnetized and all jumped to the north, your estimate of the population line would be mistaken. You’d dig your hole a bit north of the treasure.

Assume that X is not a random variable; it is measured without error. 5.3.1

Our sample observations don’t lie precisely on top of the population regression line because of random errors, and now we make another limiting assumption about those errors: They all pertain to the measurement of Y only, not the measurement of X. In our graphs, this means that Price is measured with certainty, but the response of Quantity to Price is somewhat random. In effect, we assume that the sample data have jumped vertically away from the regression line before we could measure them. If this is true, it simplifies the process of finding the best estimate of the regression line: Just find a line that minimizes the vertical distances between your data and your line. This assumption also has a pleasant implication about the covariance between the error term and the independent variable:

0

)

(

)

(

)

(

)

(

)

(

)

,

(

x

_n _n

E

x

_n _n

E

x

_n

E

_n

x

_n

E

_n

x

_n

E

_n

Cov

5.3.2

In words: The covariance between our independent variable and the error term is zero; they are not correlated.

Equation 5.3.2 will lead to some nice properties for the OLS estimators of the regression line. Maybe you can picture them now: Imagine how hard it would be to find the treasure if the direction of the fleas’ jumps (the error term) were correlated with the independent variable (the distance you’ve walked across the field searching for treasure). If there were, say, a positive correlation, then the closer you get to the treasure (that’s increasing the independent variable in our picture), the more the fleas tend to jump above the true regression line, not below it; they would lead you to the left of where the treasure lies, and you’d end up with a biased estimate of the true regression line.

By the way, this assumption isn’t quite the same thing as assuming that changes in Xcause changes in

Y, though it’s close. Strictly speaking, regression analysis only indicates an association between two

variables, not a cause-and-effect relationship between them. But, though there are formal tests of “causality” between two variables, the assumption that only one variable experiences random error nearly forces us to think of that variable (Y) as somehow responding to changes in the other variable (X). The next chapter will consider cases in which more than one independent variable affect a single dependent variable (Y). Later in the course we will encounter “simultaneous equations” models, which

allow for more than one dependent variable.

Assume that all random errors are identically distributed (constant variance): 5.4.1 2

)

(

_n

Var

for all

n

This property is called “homoscedasticity,” meaning “equal scatter.” If instead, for example, the variance of the error term increased as X increased, our data would look something like this:

(5)

Assume that all random errors are independently distributed: 5.5.1

0

)

(

_n _s

Cov

for all

n

s

This property is called “serial independence,” meaning that the distribution of any one error term doesn't depend on the random errors elsewhere in the series of data. This assumption is most likely to be violated in time-series data, where one observation may be influenced by preceding observations. Consider national GDP data: If one quarter’s GDP is well below potential GDP, it’s likely that the next quarter will also be below potential:

Taken together, 5.2.1, 5.4.1 and 5.5.1 imply that the random errors are identically and independently distributed (i.i.d.) random variables with a mean of zero and fixed, finite variance. One more assumption about the error terms will prove helpful:

Assume that the random errors are normally distributed. 5.6.1

This assumption will allow us to get started in testing hypotheses and forming confidence intervals. Combining these first seven assumptions, we can summarize the basic linear regression model:

X

Y

, where

~

N

(

0

,

2

)

,

0

)

,

(

x

n n

Cov

for all

n

,

0

)

,

(

n s

Cov

for all

n

s

.

Folks sometimes summarize these assumptions by saying that they’ve assumed that the random error term is “well behaved.”

Those are all the major assumptions of the model per se, but we must make two additional assumptions about the data we insert into this model:

Not all observed values of

x

_nmay be identical. 5.7.1

(6)

data would look like this,

and any number of straight lines would fit those observations equally well:

Hence it would be impossible to identify the best estimates of and .

Stated differently, this assumption implies that the sample variance of the independent variable is not zero:

0

1

)

(

2

N

x

n n 5.7.2

The number of model parameters must not be larger than the number of observations (N). 5.8.1 In the simple linear case, we have two model parameters to estimate: and . We must have at least two sample observations in order to estimate them. Two points determine a line, and any more points are gravy. But try fitting a line to a single point. Any line through that point will work, so it again becomes impossible to identify the best estimates of and .

There you have the eight classical assumptions of simple regression analysis. It’s time to think about how to actually calculate estimates of and under these assumptions.

4. Estimating the Regression Population Parameters: Ordinary Least Squares

As was suggested while discussing Assumption 5.3.1, we could find parameter estimates by searching for the line that minimizes the vertical distances between this line and the data. OLS estimation is the most common way of doing this, in which we minimize the squared vertical distances between the data and the estimated regression line. Why square the distances? This converts all errors to positive numbers (which makes the computations more straightforward), and amplifies the influence of “outliers,” observations that lie farther from the core of our observations. (You might think it’s not a good idea to exaggerate these least-typical observations, and we will eventually encounter other ways to estimate the regression

(7)

parameters.) It also just happens that the OLS estimators have some desirable properties, which we explore in Section 5 of this chapter.

Let’s derive the OLS estimators:

Our basic linear model holds that each data point can be described (from 5.1.4) as

n n

n

x

y

ˆ

.

Since we want to minimize something involving the errors, let’s solve this equation for the estimated error (or “residual”) term:

)

ˆ

(

ˆ

_n

y

_n

x

_n . 5.9

Now square this residual, then sum across all observations, giving us the sum of squared errors (also called “error sum of squares,” abbreviated ESS), which depends upon the numbers we select as estimators of

and . (If you choose a bad

ˆ

and

ˆ

, you’ll have bigger residuals.) In symbols: 2

)]

ˆ

(

[

)

ˆ

,

ˆ

(

_n _n n

x

y

ESS

. 5.10

Now just choose an

ˆ

and

ˆ

that will minimize this sum: 2

)]

ˆ

(

[

n n n

x

y

Min

. 5.11

How would we minimize this? By taking first derivatives with respect to

ˆ

and

ˆ

, setting these derivatives equal to zero, and solving for the optimal estimators

ˆ

and

ˆ

. (The derivation is rather

tedious--9 pages in the textbook I like best--but only requires the first and seventh of our eight assumptions. The other assumptions are required to assure some desirable properties for these OLS estimators.) The first derivatives yield so-called “normal equations,” which can be solved for

ˆ

and

ˆ

to yield:

2 ,

ˆ

x y x

s

, 5.12

x

y

ˆ

. 5.13 In prose:

The OLS estimator of the slope equals the sample covariance between x and y, divided by the sample variance of x. (Isn’t that great?! The simple covariance between x and y had serious drawbacks in studying the relationship between two variables, but if we just divide it by the variance of x and allow for an intercept, many of those drawbacks disappear! And we’ll see that regression analysis opens up many other avenues of learning that are missed by correlation and covariance. Better Living Through Calculus!) The OLS estimator of the intercept is derived from the estimate of the slope—find the slope first, then find the only intercept that allows a line with that slope to pass through both the average value of y and the average value of x.

Long, long ago (1978), in a galaxy far away, young college students like your instructor derived and calculated such estimators by hand. Using slide-rules to square and sum deviations and cross-products, re-defining variable units to ease the calculations, keypunching programs on eighty-space IBM cards (one card per line of program), standing in long lines for printed output from screen-less mainframe computers... Those were the days. Now STATA will do this for enormous multi-dimensional data sets, and have the results on your screen before you can glance up. For that reason, we don’t do many tedious problems involving the normal equations any more. While this is generally a good thing, it sometimes leads to The Bubba Effect. It’s so easy to do regressions that people become thoughtless, committing what you might call Type III error: the use of a good model in the wrong situation. When analysis was difficult, there were fewer tools at our disposal, but people didn’t use them casually.

The silver lining is that, freed from some tedium, we can spend more class time developing your clinical instincts about the actual practice of applied econometrics. That should help minimize the probability of

(8)

Type III error. This is why you are learning to write original programs in a statistical language rather than just pointing-and-clicking, why you do a major research project for this class, why we sometimes digress into the philosophy of science surrounding econometrics, and why we spend a good deal of group time studying cases of good and bad econometrics. These are all ways of making you a more thoughtful econometrician.

By the way, machines are human too, and when squaring and summing lots of big numbers they tend to make rounding errors. It’s therefore a good idea to measure your variables in large units where possible, to keep your numbers small. For example, if GDP is your independent variable, measure it in trillions of dollars rather than in dollars, so that the computer will be working with smaller squared numbers. Just be sure to remember that the resulting slope coefficient measures the marginal effect of a one-trillion-dollar

change in GDP, not a one-dollar change.

Having discovered how to compute OLS estimators of the regression parameters, let’s move on to the properties of these estimators.

5. Properties of the OLS Estimators

Chapter 4, Section 3 outlined some desirable small- and large-sample properties for estimators. We present, without proof, a scorecard of some of the OLS estimators’ attributes:

Unbiased and Consistent: 5.14

Assumptions 2 and 3 assure that

ˆ

and

ˆ

are unbiased. Assumptions 3 and 7 assure that they are consistent (as long as the sample variance of the independent variable is not infinitely large, which is unlikely).

Efficiency: BLUE estimators 5.15

Given Assumptions 2, 3, 4, 5 and 7, the OLS estimators are the Best (most efficient) Linear Unbiased Estimators (they are “BLUE”). Among the many linear combinations of the data that form unbiased estimators, the OLS estimators are the most efficient. The proof of this property is known as the Gauss-Markov Theorem.

You should be aware that, when some of the assumptions do not hold, the OLS estimators become inferior to other options. The most common “other option” would involve so-called “maximum likelihood” estimation. With maximum likelihood estimation, we completely scrap the idea of minimizing squared deviations between data and estimated line. Instead we specify the probability distribution (or “likelihood function”) of the error term (we’ve assumed it’s a normal distribution, but it could be something else). This likelihood function will depend in part on the values of

ˆ

and

ˆ

. We then choose

ˆ

and

ˆ

that will maximize the likelihood that we would have observed the data that were collected. You will be relieved to know that…

MLE estimators: 5.16

If Assumptions 2 through 8 are met, the OLS estimators are identical to the maximum likelihood estimators.

There are three more topics we should consider about the simple linear regression model, and they all concern practical matters of measuring the precision of the OLS estimates. For example, it’s cool comfort to know the OLS estimators are the most efficient available if their variances are still very large. The three sections that round out this chapter discuss the precision of the estimators, hypothesis testing, and

forecasting.

6. Estimator Variances, Covariances, and Goodness of Fit

The variance and standard error of an estimator are indexes of its reliability and precision. Estimators like

ˆ

and

ˆ

are random variables, of course, because they are linear combinations of the estimated error terms, the

ˆ

_nterms, which are normally-distributed random variables. Thus the variances of

ˆ

and

(9)

ˆ

_{might be expected to depend upon the variance of the random error term (which we’ll simply call} 2 rather than 2whenever possible). In particular, it can be shown that

n n

x

Var

₂ 2

)

(

)

ˆ

(

5.17 and

N

x

Var

n n n n 2 2 2

)

(

)

ˆ

(

. 5.18 In words:

The estimator of the regression slope,

ˆ

, becomes more precise (has smaller variance) as 1)we sample a wider variety of x values (which makes the denominator grow),

or 2) the error term’s variance (in the numerator) is smaller, packing our data more tightly around the regression line.

The estimator of the regression intercept,

ˆ

, becomes more precise as 1) we sample a wider variety of x values,

2) the error term’s variance shrinks,

3) our observations are nearer the vertical axis, where the intercept is (which shrinks the first term in the numerator),

or 4) our sample size (N) increases. (Of course, a larger sample size will indirectly make the

ˆ

_{estimator more reliable, too, by increasing the sum of squared deviations of}_x_{. But there’s an}

additional direct effect of sample size upon the precision of

ˆ

.)

Our estimates of

ˆ

and

ˆ

are not independent of each other, as they’re computed from the same data sample. Thus they normally have a covariance. It can be shown that

x

Cov

n n 2 2

)

(

)

ˆ

,

ˆ

(

. 5.19

In words:

ˆ

and

ˆ

become less interdependent as

1) we sample a wider variety of x values (swelling the denominator), 2) the error term’s variance shrinks,

or 3) our observations are nearer the vertical axis, shrinking

x

. The covariance between

ˆ

and

ˆ

4) has a sign opposite to the average value of x (since the expression begins with a negative sign, and everything in the expression except

x

is always positive). The covariance is negative, as long as the independent variable is on-average positive.

That all makes sense if you picture a line being fit to a cluster of data. If you were to push down on the intercept while trying to make the line fit well, the slope would have to increase—a negative covariance between slope and intercept (Item 4). And this bobbing of the regression line would be less pronounced if the data are clustered near the vertical axis (Item 3) and tightly packed together (Items 1 and 2).

All three of these expressions—the variance

ˆ

, variance of

ˆ

, and covariance between

ˆ

and

ˆ

-- involve 2. Unfortunately, that’s a population variance of the error term, which is usually invisible, making it hard to actually calculate the estimators’ variances and covariances. But we can estimate this

(10)

population error’s variance, using the sample variance of our sample error,

ˆ

. An unbiased estimator is the sample variance of our error terms:

2

)]

ˆ

(

[

2

ˆ

2 2 2

N

x

y

N

s

n n n n n

, which is also sometimes called

ˆ

2. 5.20 (You might have expected that we’d divide by N-1, as we did when calculating the variance of a simple random variable. But in that case we were dividing by the number of degrees of freedom, which was N-1

because we had calculated one parameter estimate already, the sample mean

x

, leaving us with N-1 free bits of information. In this case, before we could calculate the _n terms we had to calculate two parameter estimates,

ˆ

and

ˆ

, so we only have N-2 bits of free information (degrees of freedom) left.)

The square root of this estimated error-term variance is called the “standard error of the residuals” or “standard error of the regression” noted

ˆ

or

s

. As you might expect, if we swap this

ˆ

2for the

2_{term in Equations 5.17-19, we achieve estimates of the variances and covariances of the regression} coefficients

ˆ

and

ˆ

.

If we then take the square root of the estimated variances, we have the standard errors of the regression coefficient estimates: n n

x

s

₂ 2 ˆ

)

(

ˆ

5.21

N

x

s

n n n n 2 2 2 ˆ

ˆ

)

(

5.22

To summarize this sixth section of this chapter: We’ve learned how to estimate the standard error of the regression,

s

(usually simply noted as

s

) and the standard errors of the parameter estimators,

s

_ˆ and

s

_ˆ.

These are very useful because once you know something’s standard error you can do hypothesis tests and construct confidence intervals. Of course, you’ll probably never use Equations 5.20-22 to calculate these standard errors, because they are routinely calculated by statistical software. But now you know where these numbers come from.

One more loose end to tie off: Because

s

_ˆ and

s

ˆ give us a measure of the precision of our estimates of and , you might suppose that

s

is giving us a measure of the precision of the whole regression line, all at once, by measuring how tightly the data are packed around our estimated line. That’s the right instinct, but it must be refined because

s

, as an estimated standard error, is sensitive to the units in which our variables are measured. But we can construct a closely-related statistic that’s a more reliable measure of the goodness-of-fit between our data and our regression line:

Why are we doing a regression? Iin order to explain some of the changes in Y by relating them to changes in X. The maximum amount of squared deviations in Y that we could possibly explain would be… all of them! Call that number the Total Sum of Squares, TSS:

2

)

(

y

TSS

n n 5.23

(11)

How well has our regression done at anticipating and explaining these variations in Y? Why not measure this by adding up our failures to explain, the squared distances of our data from our regression line--the squared errors from our regression analysis. Call that number the Error Sum of Squares:

2

)]

ˆ

(

[

_n n n

x

y

ESS

5.24

The difference between TSS and ESS will be equal to the sum of squared deviations in y that our regression does correctly anticipate. These are squared distances between our data and the points directly above or below them on the estimated regression line—the Regression Sum of Squares:

2 2

]

)

ˆ

[(

)

ˆ

(

y

x

y

RSS

n n n n 5.25

Look at the definitions of TSS and ESS. The first adds up deviations around our sample mean of y; the second adds up deviations around the numbers that our regression predicts to be the mean of y, given the level of x we’ve observed. The first measures total variation in y, the second measures total

unexplained variation in y. So you could measure the percentage of the variation in y that our regression fails to explain with the ratio

TSS

ESS

.

We can restate this by calculating the percentage of variation in y that our regression does explain, the

coefficient of determination, noted

R

2:

TSS

RSS

TSS

ESS

R

2

1

,

0

R

2

1

5.26 2

R

is a proportion, so it is unaffected by changes in the unit of measurement of our variables; it’s a unit-free measure of the goodness of fit between our data and our regression line, because the numerator and denominator are measured in the same units.

R

2lies between 0 and 1 because we can’t explain more than 100% of the variation in y, nor less than 0%.

By the way, if you’re straining to see how this 2

R

is related to

s

2, which we set out to improve upon as a measure of the regression’s goodness of fit, I can help:

)

1

(

)

2

(

1

2 2 2

N

s

N

s

R

y .

If you’re stymied by the name 2

R

, it’s due to the fact that this statistic is equal to the square of the sample correlation (abbreviated r) between the observed values of

y

_n and our regression’s predicted values for y

(which are

ˆ

x

_n, or

y

ˆ

_n). Though

R

2is widely used as a measure of goodness of fit, we’ll see that it has some limitations. Two of them would become clear after staring at our derivation: You can not use

2

R

to compare the goodness of fit between two regressions if 1) one regression contains an intercept and the other does not, nor if 2) the dependent variables of the two regressions are not the same (for example, if one uses y and the other uses ln(y)).

7. Hypothesis Testing

We’ve discussed four regression statistics that you might want to hypothesize about:

ˆ

,

ˆ

,

ˆ

2and

R

2. We’ll defer tests concerning 2

R

until the next chapter, when we can be more robust about it. The most common test asks whether

ˆ

is zero. That would indicate that there’s no relationship between X and Y, and exploring this relationship was presumably our reason for doing the regression. For completeness I’ll summarize the typical tests for all three regression parameters, then present one new twist on hypothesis testing.

(12)

ˆ

and

ˆ

are normally distributed (since they’re derived from regression errors, which are presumed to be normally distributed).

ˆ

and

ˆ

are distributed independently of

ˆ

2.

2 2 2 2

~

ˆ

N n n Test for

ˆ

: 5.27 0 0

:

H

(where ₀ is some number you supply, frequently zero)

0

:

A

H

(two-tailed test), or

H

A

:

₀or 0(one-tailed tests)

Test Statistic under H0: 2 ˆ 0

~

ˆ

N

t

s

t

Decision Rule: Reject

H

₀if

t

_c(two-tailed test) or

t

_cor

t

_c(one-tailed tests), where

t

_cis a critical value determined by the level of significance.

In words: To test whether the regression slope is equal to a particular number ( ₀), find how many standard deviations your estimate lies from that number. The greater this distance between estimate and

0, the less believable 0 becomes.

Test for

ˆ

: 5.28

Identical. Replace each in 5.27 with an , and you have it.

Test for 2: 5.29 0 2 2 0

:

H

0 2 2 0

:

H

Test Statistic under H0:

2 2 2 0 2

~

)

2

(

ˆ

N

S

Decision Rule: Reject

H

₀if

S

_c2

And now the new twist: Recall that there’s really no magic level of significance that’s universally appropriate. A hypothesis test that simply chooses a level of significance and reports a success or failure leaves us vaguely unsatisfied: If the hypothesis failed, by how much? If it didn’t satisfy your tolerance for

significance, it might have satisfied mine. For that reason it’s become common to report a “p-value” for each estimated coefficient, where the p-value equals the level of significance at which your null hypothesis would have just barely passed the hypothesis test.

Quick Example: Say that you’ve done a regression of quantity demanded (measured in bushels of cucumbers) on price, with the following results:

(Standard errors of estimates lie below the parameter estimates)

P

Q

) 5 . 2 ( ) 3 . 52 (

.

0

5

.

0

400

ˆ

2

.

78

R

N

22

s

37

.

8

(13)

ˆ

_{, your slope estimate in a regression, is equal to 5.0, and the estimated standard error of}

5

.

0

400

ˆ

2

.

78

R

N

22

s

37

.

8

,

because you wonder how many bushels of cucumbers will be bought at a price of $4 per bushel. Since

ˆ

and

ˆ

are unbiased estimators of and , you’d get an unbiased estimate of

Q

|

_P ₄_.₀(“quantity, given that price equals 4.0”) by just setting price equal to 4.0 in your regression equation and solving for the forecast level of Q. This yields a “point estimate” equal to

400

.

.

You’d naturally like some indication of the reliability of this point estimate. If we knew the standard error of this predictor, we could calculate a confidence interval. Let’s say you’d like a 95% confidence interval for the quantity demanded at a price of 4.0. At this point we must be rather precise, because there are two different, closely-related confidence intervals that might interest us:

We could construct an interval that, with 95% certainty, captures the point on the regression line that lies above a price of 4.0.

We could construct an interval that captures 95% the demand conditions we’ll actually experience at a price of 4.0.

The first option gives us a range within which the average of our sales is likely to fall, a “confidence interval for the mean predictor.”

We’re 95% sure that, at a price equal to 4.0, the population regression line lies between these two points. If we constructed such a range for each possible level of price, we’d have constructed a space within which we’re 95% sure that we’ve captured the true population regression line:

(14)

(The interval gets wider as we move farther from our average observation of X, because we’re forecasting

farther from the core of the information we’ve gathered.)

The second option gives us a wider range within which the actual level of our sales is likely to fall, a “confidence interval for point forecasts.” This has to be a wider range, since actual sales vary around their average levels:

The standard error of the mean estimator (the first option) can be estimated by

]

)

(

)

(

1

[

₂ 2 0 2 ˆ0 n n y

x

N

s

.

I will spare you the proof. In words, we can forecast the mean value of y at any particular value

x

₀of the independent variable; the standard error of that forecast is larger

1) if the estimated variance

s

2of our error term is larger—if the data are widely dispersed around our regression line;

2) if our sample size, N, is small;

3) if we are trying to forecast far away from the average observation of x; and 4) if our observations of x are not very well dispersed.

Since we can compute the standard error of our forecasted mean, we can employ the usual logic to construct a confidence interval for the mean forecast:

c y

t

s

y

0 ˆ 0

ˆ

,

where

t

_cis a critical t-statistic value that depends on our level of confidence. In words, we are c% sure that the forecasted mean lies within t standard deviations of

y

ˆ

₀.

(15)

For point forecasts (the second option), the relevant standard error is

]

1

)

(

)

(

1

[

₂ 2 0 2 ˆ₀ n n y

x

N

s

.

Compared to the previous standard error, there’s one small difference: We’ve added in an extra 2

s

, because individual observations vary around their expected value, with a variance we’ve estimated to be

2

s

. By using this

0

ˆ

y

s

you can compute confidence intervals in the normal way, so that your confidence interval for a point forecast would be

c y

t

s

y

0 ˆ 0

ˆ

.

9. Comparing Forecasts

Imagine that two farmers from our market have developed slightly different regressions to forecast their sales. They might want a way to compare the approaches, to decide which makes more reliable forecasts. Call the forecasted value of the dependent variable

y

_nf , and the actual value that eventually occurred

y

_n. Here are three typical scorecards you might compute after making several forecasts:

(For all three, a low score is better than a high score.) Mean Squared Error (MSE):

2

)

(

2

N

y

MSE

n n f n

Looks like a variance, doesn’t it? Some prefer its square root, “root mean squared error.” Mean Absolute Percent Error (MAPE), which only works if all y are positive:

100

]

|

1

[

n n n f n

y

N

MAPE

Mean Squared Percentage Error (MSPE):

100

]

)

(

1

[

2 n n n f n

y

N

MSPE

Again, some prefer its square root, “root mean squared percentage error.”

Each approach has its champions and detractors, and your choice of a scorecard for forecasts probably should depend upon the situation. In some cases the amount of error is more important than the percentage

error, in others not; in some cases you want to severely penalize large errors by squaring them, but some times this would be inappropriate.

If you’re impatient and want to evaluate several forecasters without waiting for additional observations, all is not lost. You can estimate your regressions using only a percentage of the observations you have (say, 90%), use the regressions to make forecasts of y at the x values in your unused observations, then compute the MSE or MAPE or MSPE for these so-called “post-sample forecasts.”

10. More Regressions to Come

Simple linear regression is powerful, but not powerful enough for a complicated world in which we can’t run controlled experiments. In the next chapter we consider expanding our model to cases with more than one explanatory variable. In the succeeding chapters we will relax most of our eight simplifying

(16)

Useful STATA Commands:

Values you supply are in italics. Words you type literally are in boldface. Options are in [ ] you don’t type.

Entering data:

*From the Keyboard, within STATA:

clear /* to clear any existing data in memory */

input names /*names: 1-8 characters, period for missing values; STATA is case-sensitive */ /* enter observations one at a time, space between variables, starting next line */

end

/* later new observations: input ... end */ /* later new variables: inputnames */ * Outside STATA, you can either

* enter data in a text file (using something like WordPad), separating each variable by spaces and * each observation by a carriage return, and save it as

* filename.asc, , then use the Stata infile command:

infilevariablelist using filename.extension.

/* string variables: infilestrxxvarname where 0<xx<81 and xx=stringlength */ /* in file, strings go in “ “ marks if they contain blanks */

*or

* put your data into an Excel file, using the first row for variable names. Save it as a tab-delimited * ASCII file, and use the Stata command

insheetusingfilename

* Stata will automatically read the variable names from your file.

Seeing data:

list summary describe

Saving data:

savefilename /*may not include blanks; saved as filename.dta */

/* use savefilename, replace if overwriting an existing file*/ Re-using saved Stata data:

usefilename /*save or clear current data in memory first if necessary */

Using non-STATA (ASCII) data files that are not in the same directory as STATA: *For example, if a data file is in a common directory in the lab, take this approach:

*a. Find the data file you want to use, in the common drive. Drag a copy of it onto your desktop if you wish.

*b. Open the STATA program, and issue the first part of the INFILE command:

infilevariablenamesusing

*Now, rather than trying to type out the address of the file correctly, just go up to the FILE menu, *and choose the FILENAME option. You'll get a dialog box. Navigate to the desktop, and click on *the name of your data file. This name will automatically appear in the command line you were *typing, enclosed in " " marks. Now you can just tap your ENTER key, and the command should *bring the data into the program for you.

(17)

Altering Variables:

generate newvariable =expression [ifexpression] [inrange]

replaceoldvariable=expression [ifexpression] [inrange] *For creating or altering parameters, use

scalarscalarname=expression

/* you can follow this with scalar list scalarname or scalar drop */ *Expressions can include functions: abs(x), exp(x), ln(x), log10(x), sqrt(x), or statistical functions. *For correlations, use correlate [variablelist] [weight] [ifexpression] [inrange] [,means covariance] *For pairwise correlations only, you can use pwcorr [variablelist] [weight] [ifexpression] [inrange] Regression:

regressdepvariable indepvars [weight] [ifexpression] [inrange] [noconstant] /* Saved results include:

e(N) # observations e(df_m) model degrees of freedom e(df_r) residual degrees of freedom e(r2) R-squared

e(F) regression F-statistic e(rmse) root mean square error e(b) coefficient vector e(V) var-cov matrix of estimators */ *See these with estimates list or, for the matrixes, matrix list matrixname

predictnewvariablename [ifexpression] [inrange] [, statistic] /* generates predicted values, where statistic can be

pr(a,b) probability a<y<b residuals residuals

rstandard standardized residuals stdp standard error of the prediction

stdf standard error of forecast stdr standard error of the residual Saving your program and/or results:

log using filename.do [noproc append replace]

/* saves your file as a “do” file, which can be edited and run again */ /* the noproc option saves only what you type, not any output */ /* append will append your work to the end of an existing file */ /* replace will overwrite an existing file */

*To suspend, To resume, To quit logging for good,

log off log on log close

To quit using STATA:

exit, clear /*but be sure you’ve saved your data first, if you need to */

Re-running your .do files: *Just type