Regression - ST104a Vle

In this section we can only hope to review the fundamentals of what is a very large (and important) topic in statistical analysis — several textbooks and courses focus exclusively on regression analysis. Here we shall concentrate on thesimple linear (bivariate) regression model. This will allow calculations to be performed on a hand calculator, unlike multiple regression which typically requires the help of statistical computer packages due to the complexity and sheer number of calculations involved.

In the simple model, we have two variables Y and X:

Y is thedependent (or response) variable — that which we are trying to explain using:

X, theindependent (or explanatory) variable — the factor we think influences Y .

Multiple regression is just a natural extension of this set-up, but

with more than one explanatory variable.⁵ ⁵Multiple linear regression is covered in04b Statistics 2.

There can be a number of reasons for wanting to establish a mathematical relationship between a response variable and explanatory variable(s), for example:

To find and interpret unknown parameters in a known relationship.

To understand the reason for such a relationship — is it causal?

To predict or forecast Y for specific values of the explanatory variable(s).

11.9.1 The simple linear regression model

We assume that there is a true (population) linear relationship between a response variable y and an explanatory variable x of the approximate form:

y= α + βx,

where α and β are fixed, but unknown, population parameters. Our objective is to estimate α and β using (paired) sample data (xi, y_i), i= 1, . . . , n.

Note the use of the word ‘approximate’. Particularly in the social sciences, we would not expect a perfect linear relationship between the two variables. Hence we modify this basic model to

y= α + βx + ,

where is some random perturbation from the initial ‘approximate’

line. In other words, each y observation almost lies on the line, but

‘jumps’ off the line according to therandom variable . This perturbation is often referred to as the ‘residual’ or ‘error term’.

For each pair of observations (xi, y_i), i= 1, . . . , n, we can write this as

yi= α + βxi+ i, i= 1, . . . n.

The random deviations ₁, ₂, . . . , _n corresponding to the n data points are assumed to beindependently normally distributed, withzero mean and constant (but unknown) variance σ². That is

_i∼ N (0, σ²), i= 1, . . . , n.

This completes the model specification. To summarise, the assumptions of the simple linear regression model are:

A linear relationship between the variables of the form y= α + βx + .

The existence of three model parameters: the linear parameters αand β and the residual variance σ².

Var(i)= σ²for all i= 1, . . . , n, that is it does not depend on the explanatory variable.

The residuals are independent and N (0, σ²).

You may feel that some of these assumptions are particularly strong and restrictive. For example why should the residual variance be constant across all observations? Indeed your scepticism serves you well. In a more comprehensive discussion of regression, such as in 20 Elements of econometrics, model assumptions would be properly tested to assess their validity. Given the limited scope of regression in04a Statistics 1, sadly we are too time-constrained to consider such tests in detail. However do be aware that with any form of modelling, a thorough critique of model assumptions is

essential.⁶ Analysis based on false assumptions leads to invalid ⁶The validity of certain model assumptions is considered in04b Statistics 2.

results — a bad thing.

11.9.2 Parameter estimation

As mentioned above, our principal objective is to estimate α and β, that is the y-intercept and slope of the true line. To fit a line to some data as in this case, we need a criterion for establishing which straight line is in some sense ‘best’. The criterion used is to minimise the sum of the squared distances between the observed values of y_i and the values predicted by the model. (This ‘least squares’

estimation technique is explored in depth in04b Statistics 2.) The estimated least squares regression line is written as:

ˆy= a + bx,

where a and b denote our estimates for α and β, respectively. The

derivation of the formulae for a and b is not required,⁷although you ⁷The derivation is presented in04b Statistics 2.

do need to remember the formulae, which are:

b =

So note that when evaluating these parameter estimates, you need to compute b first, since this is needed to compute a.

11.9.3 Prediction

Having estimated the regression line, an important application of it is forprediction. That is, for a given value of the explanatory variable, say x₀, then we can use it in the estimated regression line to obtain a prediction for y. This prediction, ˆy, is simply computed using

ˆy= a + bx0,

remembering to attach the appropriate units to the prediction (i.e.

the units of measurement of the original y data). Also, for that matter, ensure the value you use for x₀is correct — if the original x data is in 000s, then a prediction of y when the explanatory variable is 10,000, say, would mean x₀= 10, and not 10,000!

Activity

A study was made by a retailer to determine the relationship between weekly advertising expenditure and sales (in thousands of pounds). Find the equation of a regression line to predict weekly sales from advertising. Estimate weekly sales when advertising costs are £35,000.

Adv. costs (in £000s) 40 20 25 20 30 50 Sales (in £000s) 385 400 395 365 475 440 Adv. costs (in £000s) 40 20 50 40 25 50 Sales (in £000s) 490 420 560 525 480 510 Summary statistics representing sales as y and advertising costs as x give:

Xx= 410,X

x²= 15650,X

y= 5445, X

y²= 2512925,X

xy= 191325.

So the parameter estimates are:

b = 191325 − (12 ×⁴¹⁰₁₂ ×⁵⁴⁴⁵₁₂ ) 15650 − (12 × ⁴¹⁰₁₂2

) = 3.221 a = 5445

12 − 3.221 ×410

12 = 343.7.

Hence the estimated regression line is ˆy= 343.7 + 3.221x,

and the estimated sales for £35,000 worth of advertising is ( ˆy | x= 35) = 343.7 + 3.221 × 35 = 456.4, which is £456,400.

Note that since the advertising costs were given in £000s, we used x₀= 35, and then converted the predicted sales into pounds.

11.9.4 Points to watch about linear regression

Non-linear relationships

Note first that you have only learned how to use a straight line for your best fit. So you could be missing quite important non-linear relationships, particularly if you were working in the natural sciences.

Which is the dependent variable?

Note also that it is essential to correctly establish which is the dependent (y) variable. In the above example, you would have a different line if you had taken advertising costs as y and sales as x!

So remember to exercise your common sense — we would expect sales to react to advertising campaigns rather than vice versa.

Extrapolation

In the previous example you used your estimated line of best fit to predict the value of y for a given value of x, i.e. advertising expenditure of £35,000. Such prediction is only acceptable if you are dealing with figures which lie within the dataset. If you use the estimated regression line to predict using x₀which is outside of your available sample data, then you are performingextrapolation, for which any predictions should be viewed with extreme caution.

For the previous example, it may not be immediately obvious that the relationship between advertising expenditure and sales could change, but a moment’s thought should convince you that, were you to quadruple advertising expenditure, you would be unlikely to get a nearly 13-fold increase in sales! Basic economics would suggest diminishing marginal returns to advertising expenditure.

Sometimes it is very easy to see that the relationship must change.

For instance, consider the next example showing an anthropologist’s figures on years of education of a mother and the number of

children she has, based on a Pacific island.

Activity

Figures from our anthropologist show a negative relationship between the number of years of education, x, of the mother and the number of live births she has, y. The regression line is:

ˆy= 8 − 0.6x

based on figures of women with between 5 and 8 years of education who had 0 to 8 live births. This looks sensible. We predict 8 − 3= 5 live births for those with 5 years of education and 8 − 6= 2 live births for those with 10 years of education.

This is all very convincing, but say a woman on the island went to university and completed a doctorate and so had 15 years of

education. She clearly cannot have minus 1 children! And, if someone missed school entirely, is she likely to have 8 children? We have no way of knowing. The relationship shown by our existing figures will probably not hold beyond the x data range.

In document ST104a Vle (Page 178-182)