2 The forecaster’s toolbox 13
2.2 Numerical data summaries
Numerical summaries of data sets are widely used to capture some essential features of the data with a few numbers. A summary number calculated from the data is called a statistic.
Univariate statistics
For a single data set, the most widely used statistics are the average and median.
Suppose N denotes the total number of observations and xi denotes the ith observation. Then the average can be written as1
¯ x = 1
N
N
i=1
xi = (x1 + x2 + x3 + . . . + xN )/N.
The average is also called the sample mean.
By way of illustration, consider the carbon footprint from the 20 vehicles listed in Section 1.4.
The data listed in order are
4.0 4.4 5.9 5.9 6.1 6.1 6.1 6.3 6.3 6.3 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.8 6.8 6.8
In this example, N = 20 and xi denotes the carbon footprint of vehicle i. Then the average
1The
indicates that the values of xii are to be summed from i = 1 to i = N .Forecasting: principles and practice 19
The median , on the other hand, is the middle observation when the data are placed in order. In this case, there are 20 observations and so the median is the average of the 10th and 11th largest observations. That is
median = (6.3 + 6.6)/2 = 6.45.
Percentiles are useful for describing the distribution of data. For example, 90% of the data are no larger than the 90th percentile. In the carbon footprint example, the 90th percentile is 6.8 because 90% of the data (18 observations) are less than or equal to 6.8. Similarly, the 75th percentile is 6.6 and the 25th percentile is 6.1. The median is the 50th percentile.
A useful measure of how spread out the data are is the interquartile range or IQR. This is simply the difference between the 75th and 25th percentiles. Thus it contains the middle 50% of the data. For the example,
IQR = (6.6
−
6.1) = 0 .5.An alternative and more common measure of spread is the standard deviation. This is given by the formula
The most commonly used bivariate statistic is the correlation coefficient. It measures the strength of the relationship between two variables and can be written as
rk =
where the first variable is denoted by X and the second variable by y. The correlation coefficient only measures the strength of the linear relationship; it is possible for two variables to have a strong non-linear relationship but low correlation coefficient. The value of r always lies between -1 and 1 with negative values indicating a negative relationship and positive values indicating a postive relationship.
For example, the correlation between the carbon footprint and city mpg variables shown in Figure 2.7 is -0.97. The value is negative because the carbon footprint decreases as the city mpg increases. While a value of -0.97 is very high, the relationship is even stronger than that number suggests due to its nonlinear nature.
20 Forecasting: principles and practice
Figure 2.7: Examples of data sets with different levels of correlation.
The graphs in Figure 2.7 show examples of data sets with varying levels of correlation. Those in Figure 2.8 all have correlation coefficients of 0.82, but they have very different shaped relationships.
This shows how important it is not to rely only on correlation coefficients but also to look at the plots of the data.
Figure 2.8: Each of these plots has a correlation coefficient of 0.82. Data from Anscombe F. J.
(1973) Graphs in statistical analysis. American Statistician, 27, 17–21.
Autocorrelation
Just as correlation measures the extent of a linear relationship between two variables, autocorre-lation measures the linear reautocorre-lationship between lagged values of a time series. There are several autocorrelation coefficients, depending on the lag length. For example, r1 measures the relationship between yt and yt−1, r2 measures the relationship between yt and yt−2 and so on.
Figure 2.9 displays scatterplots of the beer production time series where the horizontal axis shows lagged values of the time series. Each graph shows yt plotted against yt−k for different values of k. The autocorrelations are the correlations associated with these scatterplots.
Forecasting: principles and practice 21
Figure 2.9: Lagged scatterplots for quarterly beer production.
Listing 2.8: R code beer2 <− window( ausbe er , start =1992, end=2006−.1 ) lag . plot (beer2 , lags =9, do. lines =FALSE)
The value of rk can be written as
rk =
T t=k+1(yt−
¯y)(yt−k−
¯y)
T t=1(yt−
¯y)2where T is the length of the time series.
The first nine autocorrelation coefficients for the beer production data are given in the following table.
r1 r2 r3 r4 r5 r6 r7 r8 ’ r9
-0.126 -0.650 -0.094 0.863 -0.099 -0.642 -0.098 0.834 -0.116
These correspond to the nine scatterplots in the graph above. The autocorrelation coefficients are normally plotted to form the autocorrelation function or ACF. The plot is also known as a correlogram.
Listing 2.9: R code ac f ( beer 2 )
In this graph:
•
r4 is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be two quarters apart.•
r2is more negative than for the other lags because troughs tend to be two quarters behind peaks.22 Forecasting: principles and practice
Figure 2.10: Autocorrelation function of quarterly beer production
Figure 2.11: A white noise time series.
White noise
Time series that show no autocorrelation are called "white noise". Figure gives an example of a white noise series.
Listing 2.10: R code
se t . seed (30 ) x <− t s(rnorm(50))
plot(x , mai n="Whi te no is e " )
Listing 2.11: R code Acf(x)
For white noise series, we expect each autocorrelation to be close to zero. Of course, they are not exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within
±
2/√
T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF. If there are one or more large spikes outside these bounds, or if more than 5% of spikes are outside these bounds, then the series is probably not white noise.Forecasting: principles and practice 23
Figure 2.12: Autocorrelation function for the white noise series.
In this example, T = 50 and so the bounds are at
±
2/√
50 =±
0.28.All autocorrelation coefficients lie within these limits, confirming that the data are white noise.