Correlation, causation and forecasting

Multiple regression

5.7 Correlation, causation and forecasting

Correlation is not causation

It is important not to confuse correlation with causation, or causation with forecasting. A variable x may be useful for predicting a variable y, but that does not mean x is causing y. It is possible that x is causing y, but it may be that the relationship between them is more complicated than simple causality.

For example, it is possible to model the number of drownings at a beach resort each month with the number of ice-creams sold in the same period. The model can give reasonable forecasts, not because ice-creams cause drownings, but because people eat more ice-creams on hot days when they are also more likely to go swimming. So the two variables (ice-cream sales and drownings) are correlated, but one is not causing the other. It is important to understand that correlations are useful for forecasting, even when there is no causal relationship between the two variables.

However, often a better model is possible if a causal mechanism can be determined. In this example, both ice-cream sales and drownings will be affected by the temperature and by the numbers of people visiting the beach resort. Again, high temperatures do not actually cause people to drown, but they are more directly related to why people are swimming. So a better model for drownings will probably include temperatures and visitor numbers and exclude ice-cream sales.

Confounded predictors

A related issue involves confounding variables. Suppose we are forecasting monthly sales of a company for 2012, using data from 2000–2011. In January 2008 a new competitor came into the market and started taking some market share. At the same time, the economy began to decline.

In your forecasting model, you include both competitor activity (measured using advertising time on a local television station) and the health of the economy (measured using GDP). It will not be possible to separate the effects of these two predictors because they are correlated. We say two variables are confounded when their effects on the forecast variable cannot be separated.

Any pair of correlated predictors will have some level of confounding, but we would not normally describe them as confounded unless there was a relatively high level of correlation between them.

Confounding is not really a problem for forecasting, as we can still compute forecasts without needing to separate out the effects of the predictors. However, it becomes a problem with scenario forecasting as the scenarios should take account of the relationships between predictors. It is also a problem if some historical analysis of the contributions of various predictors is required.

Multicollinearity and forecasting

A closely related issue is multicollinearity which occurs when similar information is provided by two or more of the predictor variables in a multiple regression. It can occur in a number of ways.

• Two predictors are highly correlated with each other (that is, they have a correlation coef-ficient close to +1 or -1). In this case, knowing the value of one of the variables tells you a lot about the value of the other variable. Hence, they are providing similar information.

• A linear combination of predictors is highly correlated with another linear combination of predictors. In this case, knowing the value of the first group of predictors tells you a lot about the value of the second group of predictors. Hence, they are providing similar information.

The dummy variable trap is a special case of multicollinearity. Suppose you have quarterly data and use four dummy variables, D₁, D2, D3 and D₄. Then D₄ = 1 − D₁− D₂− D₃, so there is perfect correlation between D₄ and D₁+ D₂+ D₃.

When multicollinearity occurs in a multiple regression model, there are several consequences that you need to be aware of.

1. If there is perfect correlation (i.e., a correlation of +1 or -1, such as in the dummy variable trap), it is not possible to estimate the regression model.

2. If there is high correlation (close to but not equal to +1 or -1), then the estimation of the regression coefficients is computationally difficult. In fact, some software (notably Microsoft Excel) may give highly inaccurate estimates of the coefficients. Most reputable statistical software will use algorithms to limit the effect of multicollinearity on the coefficient estimates, but you do need to be careful. The major software packages such as R, SPSS, SAS and Stata all use estimation algorithms to avoid the problem as much as possible.

3. The uncertainty associated with individual regression coefficients will be large. This is be-cause they are difficult to estimate. Consequently, statistical tests (e.g., t-tests) on regression coefficients are unreliable. (In forecasting we are rarely interested in such tests and they have not been discussed in this book.) Also, it will not be possible to make accurate statements about the contribution of each separate predictor to the forecast.

4. Forecasts will be unreliable if the values of the future predictors are outside the range of the historical values of the predictors. For example, suppose you have fitted a regression model with predictors X and Z which are highly correlated with each other, and suppose that the values of X in the fitting data ranged between 0 and 100. Then forecasts based on X > 100 or X < 0 will be unreliable. It is always a little dangerous when future values of the predictors lie much outside the historical range, but it is especially problematic when multicollinearity is present.

Note that if you are using good statistical software, if you are not interested in the specific contributions of each predictor, and if the future values of your predictor variables are within their historical ranges, there is nothing to worry about – multicollinearity is not a problem.

5.8 Exercises

1. The data below (data set fancy) concern the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

1987 1988 1989 1990 1991 1992 1993 Jan 1664.81 2499.81 4717.02 5921.10 4826.64 7615.03 10243.24 Feb 2397.53 5198.24 5702.63 5814.58 6470.23 9849.69 11266.88 Mar 2840.71 7225.14 9957.58 12421.25 9638.77 14558.40 21826.84 Apr 3547.29 4806.03 5304.78 6369.77 8821.17 11587.33 17357.33 May 3752.96 5900.88 6492.43 7609.12 8722.37 9332.56 15997.79 Jun 3714.74 4951.34 6630.80 7224.75 10209.48 13082.09 18601.53 Jul 4349.61 6179.12 7349.62 8121.22 11276.55 16732.78 26155.15 Aug 3566.34 4752.15 8176.62 7979.25 12552.22 19888.61 28586.52 Sep 5021.82 5496.43 8573.17 8093.06 11637.39 23933.38 30505.41 Oct 6423.48 5835.10 9690.50 8476.70 13606.89 25391.35 30821.33 Nov 7600.60 12600.08 15151.84 17914.66 21822.11 36024.80 46634.38 Dec 19756.21 28541.72 34061.01 30114.41 45060.69 80721.71 104660.67

(a) Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

(b) Explain why it is necessary to take logarithms of these data before fitting a model.

(c) Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

(d) Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

(e) Do boxplots of the residuals for each month. Does this reveal any problems with the model?

(f) What do the values of the coefficients tell you about each variable?

(g) What does the Durbin-Watson statistic tell you about your model?

(h) Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

(i) Transform your predictions and intervals to obtain predictions and intervals for the raw data.

(j) How could you improve these predictions by modifying the model?

2. The data below (data set texasgas) shows the demand for natural gas and the price of natural gas for 20 towns in Texas in 1969.

City Average price P Consumption per customer C

(a) Do a scatterplot of consumption against price. The data are clearly not linear. Three possible nonlinear models for the data are given below

C_i = exp(a + bP_i+ e_i) Ci =

(a₁+ b₁P_i+ e_i whenP_i≤ 60 a2+ b₂Pi+ e_i whenPi> 60 Ci = a + b1P + b2P².

The second model divides the data into two sections, depending on whether the price is above or below 60 cents per 1,000 cubic feet.

(b) Can you explain why the slope of the fitted line should change with PP?

For the second model, the parameters a₁, a₂, b₁, b₂ can be estimated by simply fitting a regression with four regressors but no constant: (i) a dummy taking value 1 when P ≤ 60 and 0 otherwise; (ii) P₁ = P when P ≤ 60 and 0 otherwise; (iii) a dummy taking value 0 when P ≤ 60 and 1 otherwise; (iv) P₂ = P when P > 60 and 0 otherwise.

(d) For each model, find the value of R² and AIC, and produce a residual plot. Comment on the adequacy of the three models.

(e) For prices 40, 60, 80, 100, and 120 cents per 1,000 cubic feet, compute the forecasted per capita demand using the best model of the three above.

(f) Compute 95% prediction intervals. Make a graph of these prediction intervals and dis-cuss their interpretation.

(g) What is the correlation between P and P²? Does this suggest any general problem to be considered in dealing with polynomial regressions—especially of higher orders?

5.9 Further reading

Regression with cross-sectional data

• Chatterjee, S. and A. S. Hadi (2012). Regression analysis by example. 5th ed. New York:

John Wiley & Sons.

• Fox, J. and H. S. Weisberg (2010). An R Companion to Applied Regression. SAGE Publica-tions, Inc.

• Harrell, Jr, F. E. (2001). Regression modelling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer.

• Pardoe, I. (2006). Applied regression modeling: a business approach. Hoboken, NJ: John Wiley & Sons.

• Sheather, S. J. (2009). A modern approach to regression with R. New York: Springer.

Regression with time series data

• Shumway, R. H. and D. S. Stoffer (2011). Time series analysis and its applications: with R examples. 3rd ed. New York: Springer.

In document Forecasting: principles and practice (Page 103-109)