Robustness to missing data - Forecasting methods

2.4 Forecasting methods

2.4.1 Robustness to missing data

Missing data. The January/February 2016 period for link 7.8 was chosen as it was the only two consecutive complete months. Also including March means that there is now missing data, which by convention are coded as N A. As mentioned in Section 2.3.1 missing data is common within any method of recording travel time. This can be due to sensor failure, bad weather conditions or recording errors.

To check the reliability of the forecast we require a complete day of observations to compare the forecasts with. The last complete day is 25th March hence Friday 25th of March is forecast using the data up to 24th March.

Bank holiday. Friday 25th March is a bank holiday which may cause some differ- ence in the data due to less people driving to work. The plots are shown for the 25th of March only to demonstrate their effectiveness. The error metrics in Section 2.5.1 therefore also consider other days, starting with the 23rd and the 24th of March. We present the fit and forecast plots for each method in a similar format to the first part of Section 2.4. We focus upon a single model for all days but further work could be conducted upon improving the model with respect to atypical days such as bank holidays that we know about in advance as opposed to the unpredictable delays we look at in Chapter 3.

Models with missing data

ARIMA model. The ARIMA model is an ARIMA(2,0,0) which doesn’t have any differencing or moving average terms. The coefficients are 1.09 and -0.20 for the AR

CHAPTER 2. EVALUATING SINGLE LINK MODELS 90

terms and 0.37 for the mean. The two biggest spikes on the partial autocorrelation plot are the first and second time lags, agreeing with the two autoregressive terms. Figure 2.4.9 shows an the forecast for the 25th of March. The model forecasts the general shape correctly however it is higher for the majority of the day than was observed.

ARIMA with selected Fourier terms. Figure 2.4.10 is a worse prediction than the ARIMA model. This is the ARIMA Fourier select model which is an ARIMA(2,0,0) model, like the standardized ARIMA model. The forecast has a part in the middle, between 8am and 12am, where it is much higher than the observed data, and the overall forecast is too high. The AR and intercept coefficients are very similar to the ARIMA model as 1.08, -0.20 and 0.37. The Fourier terms are -0.16, -0.35, 0.37, -0.61, -0.64, 0.01, 0.04 and 0.00.

ARIMA with Fourier terms. The ARIMA model with Fourier terms provides the best forecast for the 25th March. The prediction is still slightly high in parts and the travel times do fluctuate a lot from one 15-minute to the next and the forecast follows the underlying pattern. Figure 2.4.11 shows that using the K=1 for the Fourier series provides a good estimate. This is an ARIMA(2,0,0) with coefficents for the AR of 1.08 and -0.20 and the intercept of 0.37 which is the same as the ARIMA with selected Fourier terms. The Fourier coefficents are -0.16 and -0.35.

Exponential smoothing. The forecast for exponential smoothing can be seen in Figure 2.4.12. The forecast slightly overestimates the day. The model fit slightly overestimates the highest peak. As with the model for January and February α = 1.

Previous methods. Using the previous values only works if the whole of the previous day had no missing data. If there are any data missing it is unclear what should be used instead. One option would be to use the last value that had been observed

CHAPTER 2. EVALUATING SINGLE LINK MODELS 91

(a) Arima standardized model forecast. The observed data are black and the predictions are red.

(b) Arima standardized model fit. The fitted values are in green while the observed values are in black.

Figure 2.4.9: The fit and forecast for the standardized Arima model.

for that time index before the missing one.

CHAPTER 2. EVALUATING SINGLE LINK MODELS 92

(a) Arima standardized forecast with selected Fourier terms. The observed data are black and the predictions are red.

(b) Arima standardized fit with selected Fourier terms. The fitted values are in green while the observed values are in black.

Figure 2.4.10: The fit and forecast for the standardized Arima model with selected Fourier terms.

but are very spiky. This would prove a problem when inputting into the VRP problem as it could mean that if a vehicle enters the link a minute later it would emerge before.

CHAPTER 2. EVALUATING SINGLE LINK MODELS 93

(a) Arima standardized Fourier 1 forecast. The observed data are black and the predictions are red.

(b) Arima standardized Fourier 1 fit. The fitted values are in green while the observed values are in black.

Figure 2.4.11: The fit and forecast for the standardized Arima Fourier 1 model.

CHAPTER 2. EVALUATING SINGLE LINK MODELS 94

(a) Exponential smoothing forecast for 25th Mar. The observed data are black and the predictions are red.

(b) Exponential smoothing fit. The fitted values are in green while the observed values are in black.

Figure 2.4.12: The fit and forecast for the exponential smoothing model.

Summary of methods. The forecasts for the 25th March are much closer to the observed travel time values. Figure 2.4.14 shows them all plotted on one graph. All of the ARIMA parts of the models are similar so they vary with the regression terms. The previous day values are close to the observed travel times.

CHAPTER 2. EVALUATING SINGLE LINK MODELS 95

Figure 2.4.13: Previous value forecasts for 25th March.

Figure 2.4.14: Forecasts of travel times for 25th March.

To check that the January/February/March model wasn’t distorted by 25th March being a bank holiday the Fourier with K=1 was run again for Thursday 24th March. As can be seen clearly in Figure 2.4.15, using the K=1 for the Fourier series

CHAPTER 2. EVALUATING SINGLE LINK MODELS 96

provides a good estimate with slight overestimation in between 7am and 10am. This initial analysis will imply that the best method depends upon the day. A method is unlikely to be the best on one day but may be over multiple days. Clas- sifying days into extra categories and using different models for different days is a potential avenue for further work. The Fourier with K=1 and ARIMA methods were the best on different days. Some of the methods greatly overestimate travel time peaks which would be an issue in terms of accuracy, while others miss rush hour peaks entirely. This was only over three days and hence we now look over a much longer period and over more days, necessitating the use of metrics rather than visual inspection to select the optimal method for all links.

In document Modelling and inference for the travel times in vehicle routing problems (Page 102-109)