Reconciliation - Group results - Hierarchical forecasting of engineering demand at KLM Engineer

5.1 Group results

5.1.3 Reconciliation

Until now we have found that our produced forecasts are significantly better than those of the current method and that this can attributed to the inclusion of multiple models. Each node was forecast as accurately as possible with a combination of the applied base methods. By producing individual forecasts for the nodes we have introduced incoherence between the levels of the demand structure. Sections 3.6.2, 4.3.2 and 4.6 address the issue of forecast reconciliation, i.e. aligning all the individuals’ nodes of the demand structure so they aggregate to, and coincide with higher levels and their forecasts in the demand structure. Applying reconciliation to all nodes provides us with a new forecast which we can evaluate on its accuracy. Table 5.3 and Table 5.4 show comparisons for the average MASE per group. Table 5.3 provides a comparison between MASE forecast accuracy of 2017, that of the best forecast, the reconciled forecast and of the current method. Table 5.4 provides a comparison over both years and of the average.

The first thing that becomes apparent is that overall accuracy drastically decreases after reconciliation. Nearly every group loses accuracy in the process and the error increases with lower level series. Some decrease in accuracy was expected but this result is overly negative, the current forecasting approach appears much more accurate and because it forecasts averages it is also reconciled. This appears to demerit reconciliation, but why does it perform so poorly?

Group Best17 Rec17 Current17 Total 1,45 1,44 1,56 G1 0,96 1,09 1,03 G2 0,78 0,86 1,10 G3 0,53 0,54 0,61 G4 1,05 0,98 1,78 G5 0,77 1,02 1,05 G6 0,46 0,66 0,54 G7 0,80 2,24 1,08 G8 1,12 1,00 1,81 G9 0,94 2,06 1,43 G10 0,91 1,37 1,30 G11 1,64 4,65 1,97 G12 0,50 0,56 0,68 G13 0,61 2,39 0,90 G14 0,68 2,40 1,10 BTS 0,88 5,24 1,33

Table 5.3 MASE comparison of reconciled vs best and current in 2017

Table 5.4 MASE comparison reconciled forecast

Inspecting the lower groups makes clear why reconciliation decreases overall performance. In theory borrowing information from all levels is beneficial, but this advantage breaks down in series that have many 0 value observations. Reconciliation ‘projects’ patterns prevalent at higher levels to series that do not exhibit such patterns normally, mainly the series discussed in Section 4.3.3. As a result forecasts that were near faultless are adjusted to be more errors prone. Figure 5.8 shows an example of a ‘perfect’ forecast wrongly adjusted to reflect patterns from higher levels. Both the current method and best forecasts correctly predict that there is no further demand to be expected. Yet reconciliation detects the patterns in the higher levels and adjusts the zero forecasts accordingly, this introduces two big flaws. First, the accuracy as measured by MASE spikes, the average in-sample naive error is low and suddenly large errors appear causing large deviations. Secondly, reconciliation does not care that our demand is represented in hours and therefore has a strict border at 0.

Figure 5.8 Reconciled zero forecast

Group Rec16 Current16 Diff Mase17 Current17 Diff Avg CurrentAvg diff

Total 0,79 1,02 -0,23 1,44 1,56 -0,12 1,12 1,29 -0,17 G1 2,31 3,09 -0,78 1,09 1,03 0,06 1,70 2,06 -0,36 G2 3,45 4,22 -0,77 0,86 1,10 -0,24 2,15 2,66 -0,51 G3 0,50 0,64 -0,15 0,54 0,61 -0,06 0,52 0,62 -0,11 G4 0,65 0,82 -0,17 0,98 1,78 -0,81 0,81 1,30 -0,49 G5 1,20 1,03 0,17 1,02 1,05 -0,03 1,11 1,04 0,07 G6 1,15 1,25 -0,10 0,66 0,54 0,12 0,91 0,90 0,01 G7 2,82 1,67 1,15 2,24 1,08 1,16 2,53 1,37 1,16 G8 1,60 2,22 -0,62 1,00 1,81 -0,81 1,30 2,02 -0,72 G9 2,65 1,17 1,48 2,06 1,43 0,63 2,36 1,30 1,06 G10 1,64 1,26 0,38 1,37 1,30 0,07 1,50 1,28 0,22 G11 3,44 1,10 2,34 4,65 1,97 2,68 4,05 1,54 2,51 G12 0,80 1,11 -0,31 0,56 0,68 -0,12 0,68 0,90 -0,22 G13 3,13 1,56 1,57 2,39 0,90 1,49 2,76 1,23 1,53 G14 2,64 1,32 1,32 2,40 1,10 1,30 2,52 1,21 1,31 BTS 4,94 1,18 3,76 5,24 1,33 3,91 5,09 1,26 3,83

81 Reconciliation performance is thus clearly biased by series

that do not need adjustments. In order to get a better picture of the actual effect of using information from the different levels we can try to improve reconciliation by iteratively coercing the method to set problematic series to 0. Appendix M further discusses the approach where we iterate to increase performance. This introduces a bias to the forecast but is justifiable given the underlying data behaviour. Table 5.5 presents the results which are much more in line with what we would expect, overall there is a decline compared to the best forecast but the current method is generally outperformed. Figure 5.9 shows the effect on the example presented in Figure 5.8 for 2017. The most notable changes in the forecast is its scale and adherence to the positivity constraint.

The errors introduced on the lower levels imply that reducing the number of demand characteristics, and thus the number of groups, might provide higher accuracy by avoiding part of the outlier series. This assumption was tested in Appendix N where we find that reconciliation performance is slightly worse when a demand characteristic is removed. Thus we can conclude that forecasts of lower demand groups suffer in quality and do not benefit from reconciliation but they still add value by increasing reconciliation accuracy for higher level demand groups.

Figure 5.9 Iterated reconciliation forecast

With the accuracy results at an acceptable level we illustrate the effects of reconciliation with the same examples as in Section 5.1.2. Figure 5.10 and Figure 5.11 show the forecast for total and 777 demand respectively. No large changes to the forecast are apparent. Figure 5.12 present the lower scale forecasts for KLMVOH787ROUMO demand, in line with the precious observations the changes are more apparent on lower levels of demand where they have bigger impact on accuracy as well.

Group Best Mase Rec mase Current

Total 1,45 1,44 1,56 G1 0,96 1,12 1,03 G2 0,78 0,85 1,10 G3 0,53 0,50 0,61 G4 1,05 0,92 1,78 G5 0,77 1,01 1,05 G6 0,46 0,46 0,54 G7 0,80 1,20 1,08 G8 1,12 0,88 1,81 G9 0,94 1,16 1,43 G10 0,91 1,20 1,30 G11 1,64 2,47 1,97 G12 0,50 0,50 0,68 G13 0,61 0,96 0,90 G14 0,68 0,98 1,10 BTS 0,88 1,81 1,33

Figure 5.10 Total demand forecasts reconciled vs best and current method

Figure 5.11 777 demand forecasts reconciled vs best and current method

83 From the results we conclude that reconciliation is useful when applied diligently but it might have more potential applied to continuous demand with fewer 0 observations. We find that some forecasts, mainly the higher levels, are improved while others, especially the lower groups, deteriorate. Nevertheless, the coherency introduced by reconciliation makes it a useful tool for making aligned business decisions, be it budgeting for the coming year or planning for the coming months. Finally, the current method is generally outperformed which further merits the use of more complex methods.

5.1.4 Conclusion

For each of the comparisons we are able to come to clear conclusion about the effectiveness of forecasting with multiple models. In each comparison the best results were always provided by the proposed approach. This makes sense as our forecast is able to choose the best method from a large selection of forecast combinations allowing for a better fit. Comparing the results to the current method we find that on average our approach reduces the MASE with 25% or 20%, respectively when regarding the years separately or averaged.

A similar conclusion holds when assessing the comparison against the benchmark combination. We chose four models and compared their average forecast against the best results and the current method. The benchmark performed poorly at a level comparable to that of the current approach. The fact that a combination of four different models is outperformed by the current approach confirms that different data needs different models to produce accurate results. By choosing a fixed combination, we force all four models to participate and supply information even though they might be poorly matched to the data.

Reconciliation proved to be a complicated affair to produce accurate results. Its capability of taking information from all levels and project it throughout the demand structure is beneficial in a theoretical sense. However, the abundant 0 observations at the lower demand levels made the approach less effective. By iterating the method we were able to improve the results to a point where the current method is generally outperformed. What we can safely conclude is that the current approach to forecasting demand is insufficient for capturing the underlying patterns. The best forecasts drastically improve accuracy over the groups and if reconciled we can still achieve an increase over current accuracy while aligning the forecasts.

In document Hierarchical forecasting of engineering demand at KLM Engineering & Maintenance (Page 91-95)