Numerical data summaries - The forecaster’s toolbox 13

2 The forecaster’s toolbox 13

2.2 Numerical data summaries

Numerical summaries of data sets are widely used to capture some essential features of the data with a few numbers. A summary number calculated from the data is called a statistic.

Univariate statistics

For a single data set, the most widely used statistics are the average and median.

Suppose N denotes the total number of observations and xi denotes the ith observation. Then the average can be written as¹

¯ x = 1



i=1

xi = (x1 + x2 + x3 + . . . + x_N)/N.

The average is also called the sample mean.

By way of illustration, consider the carbon footprint from the 20 vehicles listed in Section 1.4.

The data listed in order are

4.0 4.4 5.9 5.9 6.1 6.1 6.1 6.3 6.3 6.3 6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.8 6.8 6.8

In this example, N = 20 and xi denotes the carbon footprint of vehicle i. Then the average

1The



indicates that the values of xii are to be summed from i = 1 to i = N .

Forecasting: principles and practice 19

The median , on the other hand, is the middle observation when the data are placed in order. In this case, there are 20 observations and so the median is the average of the 10th and 11th largest observations. That is

median = (6.3 + 6.6)/2 = 6.45.

Percentiles are useful for describing the distribution of data. For example, 90% of the data are no larger than the 90th percentile. In the carbon footprint example, the 90th percentile is 6.8 because 90% of the data (18 observations) are less than or equal to 6.8. Similarly, the 75th percentile is 6.6 and the 25th percentile is 6.1. The median is the 50th percentile.

A useful measure of how spread out the data are is the interquartile range or IQR. This is simply the diﬀerence between the 75th and 25th percentiles. Thus it contains the middle 50% of the data. For the example,

IQR = (6.6

−

6.1) = 0 .5.

An alternative and more common measure of spread is the standard deviation. This is given by the formula

The most commonly used bivariate statistic is the correlation coeﬃcient. It measures the strength of the relationship between two variables and can be written as

r^k =



where the first variable is denoted by X and the second variable by y. The correlation coefficient only measures the strength of the linear relationship; it is possible for two variables to have a strong non-linear relationship but low correlation coefficient. The value of r always lies between -1 and 1 with negative values indicating a negative relationship and positive values indicating a postive relationship.

For example, the correlation between the carbon footprint and city mpg variables shown in Figure 2.7 is -0.97. The value is negative because the carbon footprint decreases as the city mpg increases. While a value of -0.97 is very high, the relationship is even stronger than that number suggests due to its nonlinear nature.

20 Forecasting: principles and practice

Figure 2.7: Examples of data sets with diﬀerent levels of correlation.

The graphs in Figure 2.7 show examples of data sets with varying levels of correlation. Those in Figure 2.8 all have correlation coeﬃcients of 0.82, but they have very diﬀerent shaped relationships.

This shows how important it is not to rely only on correlation coeﬃcients but also to look at the plots of the data.

Figure 2.8: Each of these plots has a correlation coeﬃcient of 0.82. Data from Anscombe F. J.

(1973) Graphs in statistical analysis. American Statistician, 27, 17–21.

Autocorrelation

Just as correlation measures the extent of a linear relationship between two variables, autocorre-lation measures the linear reautocorre-lationship between lagged values of a time series. There are several autocorrelation coeﬃcients, depending on the lag length. For example, r1 measures the relationship between yt and yt−1, r2 measures the relationship between yt and yt−2 and so on.

Figure 2.9 displays scatterplots of the beer production time series where the horizontal axis shows lagged values of the time series. Each graph shows yt plotted against y_t−k for diﬀerent values of k. The autocorrelations are the correlations associated with these scatterplots.

Forecasting: principles and practice 21

Figure 2.9: Lagged scatterplots for quarterly beer production.

Listing 2.8: R code beer2 <⁻ window( ausbe er , start =1992, end=2006⁻.1 ) lag . plot (beer2 , lags =9, do. lines =FALSE)

The value of r_k can be written as

rk =



^Tt⁼k⁺¹(yt

−

^¯^y)(y^t⁻^k

−

^¯^y)



^Tt=1(yt

−

^¯^y)²

where T is the length of the time series.

The ﬁrst nine autocorrelation coeﬃcients for the beer production data are given in the following table.

r1 r2 r3 r4 r5 r6 r7 r8 ’ r9

-0.126 -0.650 -0.094 0.863 -0.099 -0.642 -0.098 0.834 -0.116

These correspond to the nine scatterplots in the graph above. The autocorrelation coeﬃcients are normally plotted to form the autocorrelation function or ACF. The plot is also known as a correlogram.

Listing 2.9: R code ac f ( beer 2 )

In this graph:

•

^r⁴ is higher than for the other lags. This is due to the seasonal pattern in the data: the peaks tend to be four quarters apart and the troughs tend to be two quarters apart.

•

^r²is more negative than for the other lags because troughs tend to be two quarters behind peaks.

22 Forecasting: principles and practice

Figure 2.10: Autocorrelation function of quarterly beer production

Figure 2.11: A white noise time series.

White noise

Time series that show no autocorrelation are called "white noise". Figure gives an example of a white noise series.

Listing 2.10: R code

se t . seed (30 ) x <⁻ t s(rnorm(50))

plot(x , mai n="Whi te no is e " )

Listing 2.11: R code Acf(x)

For white noise series, we expect each autocorrelation to be close to zero. Of course, they are not exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within

±

^2/

√

_T_where_T is the length of the time series. It is common to plot these bounds on a graph of the ACF. If there are one or more large spikes outside these bounds, or if more than 5% of spikes are outside these bounds, then the series is probably not white noise.

Forecasting: principles and practice 23

Figure 2.12: Autocorrelation function for the white noise series.

In this example, T = 50 and so the bounds are at

±

^2/

√

_{50 =}

±

^0.28.

All autocorrelation coeﬃcients lie within these limits, conﬁrming that the data are white noise.

In document 341045985-Forecasting-principles-and-practice.pdf (Page 28-33)