This section draws conclusions from the results of the DF technique comparison, discusses the impact of these results for practice and the scientific field, discusses the results’ limitations and provides directions for future research.
Conclusion of DFT performance comparison results
When looking at the results, it becomes clear that there is no such thing as the ultimate demand forecasting technique and that performance greatly varies across items and forecasting scenarios. The tables in the previous sections give an overview of which DF technique performs best in which scenario. In general, it can be said that ARIMA performs best for the largest percentage of items for all scenarios that only use historical data. ES performs well in terms of average RelRMSE for many scenarios that only use historical data. ARIMA’s average performance tends to be lower for the daily scenarios, which could be due to the fact that this data is less smooth and that there might be days with zero sales. The more complex scenarios, such as the scenarios with a daily time detail level, are also where the more advanced forecasting techniques such as neural networks show their best performance.
By far the best performance overall can be obtained by automatically selecting the best forecasting technique for each individual item(-store combination). The impact of automatic DF technique selection is greatest for forecasting scenarios with a daily time detail level. It results in an average RelRMSE improvement that ranges from 6.4% (OWC scenario) to 32.4% (M2DC scenario) compared to always using NAIVE. The average RelRMSE improvement ranges from 2.1% (M2WC scenario) to 10.1% (M2DC scenario) compared to always using the best individual DF technique. Additionally using external factors to produce forecasts has shown to be valu- able for forecasting scenarios that have a daily time detail level. The best perfor- mance was still obtained by automatically selecting the best DF technique for each item(-store combination). The external factors enabled an additional 8.6% average RelRMSE improvement in the ODCef scenario and 2.9% in the ODSef scenario com- pared to the original scenarios with historical data only. For scenarios with a weekly time detail level, adding external factors slightly decreased performance. Regression and neural networks performance greatly improved compared to scenarios with only historical data and for the daily scenarios some now outperformed ARIMA.
CHAPTER 4. PERFORMANCE COMPARISON RESULTS 46
Impact for retailers in practice
Grocery retailers are recommended to ensure their DF decision support system sup- ports a wide range of underlying quantitative forecasting techniques and that it is capable of automatically selecting the best performing one for each item(-store com- bination). When deciding on the quantitative techniques that should be supported, the grocery retailer can refer to the result tables in this chapter and prioritize the implementation of DF techniques using their performance results.
The automatic selection of the best DF technique for each item results in a substantial forecasting performance improvement. The exact amount of the aver- age RelRMSE improvement depends on the forecasting problem that the particular grocery retailer faces. It ranges from 6.4% in the OWC scenario to 32.4% in the M2DC scenario compared to always using the NAIVE technique. The case of Tesco gives an impression of exactly how large the absolute cost savings can be for a big grocer. Their improved demand forecasting system saved£100 million a year.
For retailers who are not yet ready to modify their DF system itself, the results from this study can be used by forecasters to improve forecasts for a selection of in- dividual items for which current forecasting performance is low. Forecasters can use the results of this study to decide on the most suitable forecasting technique. With some modifications, the evaluation algorithm used in this study can be transformed to a tool that can be used directly by forecasters. They could then upload the sales history of a certain product and receive a suggestion on which DF technique would perform best for that product.
Impact for the scientific field
To the best of our knowledge, this study is the first large-scale comparison of demand forecasting techniques in a food retail context. The results help researchers select a suitable DF technique for their forecasting problem. Before this paper, researchers would have been overwhelmed by the immense range of DF technique variations available. This paper discusses and explains both the most commonly used and most promising new DF forecasting techniques. Finally, the results show researchers which technique performs best in which forecasting scenario. In addition, it is the first study that examines the performance of LSTMs for food demand forecasting.
The results also showed that DF technique performance varies widely across items and scenarios. This has an impact on the research method that should be followed. The literature review showed that currently many researchers base their results on either a single product or a small selection of products. To improve the robustness of their results, we recommend researchers to base their results on as many products that match their research scope (e.g. perishables, dairy) as possible. The robustness of the results in this paper is high, since at least 908 different perishable products were used (with up to 4827 store-item combinations for scenarios with a store location detail level).
Potential limitations and future research directions
The results of the DF technique comparison in this study have some potential limitations. First of all, the performance comparison was conducted using a dataset from a single grocery retailer and performance results might be slightly different for another grocery retailer. In addition, some characteristics of country where this grocer’s supermarkets were located (Ecuador) might have influenced the results for
CHAPTER 4. PERFORMANCE COMPARISON RESULTS 47
the external factors evaluation. For example, the climate in Ecuador is fairly stable year-round, so the external factors from the weather category might prove to have more predictive power in another country where the climate has more pronounced seasons (such as The Netherlands). Future research could include data from other grocers in other countries to minimize the influence of grocer- and location-specific factors on the results. Another potential limitation is that only lag-1 historical data was used for the linear regression and neural networks techniques in this study, so when more lags are added their performance relative to other DF techniques will likely improve. Future research could add extra lags to the models used in this study, for example lag-7 and lag-31 data could be added to scenarios with a daily time detail level to specifically give the models information about sales in the prior week or month. Future research could also investigate what the optimal size is for the sales history training set. Currently, all data in the training set was indeed used for training. Even though neural networks require a lot of data to perform well, it is not necessarily the case that more historical data is better, because data that is very old might not be representative anymore of current demand patterns. In some cases, using less history to train a neural network could result in a better performance [65]. Due to the anonymized nature of the dataset, there were also some interesting aspects that could not yet be examined. When there is information available on the brand and product names, future research could include external factors such as search trends and social media mentions and sentiment for brands and products in the comparison. When there is information available on how product prices changed over time, this could be used as an extra factor in the demand forecasting models. Future research could also examine how well neural networks perform when they are trained on the historical sales data (and external factor data) of multiple items. When they are trained on items that are sufficiently similar (e.g. from the same item class such as ‘yoghurt’), this might further boost performance, since the neural networks then have more data to learn from.
Future research could also investigate the generalizability of the results to other types of retailers or to other industries. Accurate demand forecasting is also im- portant for for example fashion retailers. Many of their clothing items are seasonal and have a zero or low salvage value at the end of the sales period, so they want to prevent ordering too many items. However, they also don’t want to order too few items, since long lead times from factories in Asia make it difficult to order replenishments during the season. The energy industry is an example of another in- dustry where demand forecasting is important, since accurate demand forecasts are required (at a sub-daily time detail level) to maintain the balance in the energy net- work. It is expected that the results from this study are well generalizable to other retailers or other industries facing forecasting problems with similar characteristics as the ones for grocers.
5
DF Improvement Process
The previous chapters focused primarily on demand forecasting techniques, which form the quantitative core of the demand forecasting process. To give a holistic view, this chapter provides a step-by-step process that can guide DF improvement efforts and gives an overview of other factors that can be considered during the im- provement of the wider demand forecasting process. The DF improvement process is depicted in figure 5.1 and contains the following steps:
1. Current forecasting situation assessment 2. Forecasting goals determination
3. Demand forecasting techniques selection 4. DSS implementation and adoption 5. Organizational factor alignment 6. Results evaluation
Step 1 analyses the current forecasting situation (as-is), step 2 defines the organi- zation’s forecasting improvement focus (to-be), steps 3 to 5 enable the organization to go from the as-is to the to-be state and step 6 is about evaluating results and making adjustments where necessary. When aiming for continuous improvement, another iteration of the DF improvement process can be started.
Assess current situation Determine forecasting goals Select DF techniques Implement in DSS Align organizational factors Evaluate Results
Figure 5.1: Demand forecasting improvement process
CHAPTER 5. DF IMPROVEMENT PROCESS 49
5.1
Step 1: Assess Current Situation
The goal of this assessment is to inventory what forecasts are being created in the organization, what decisions they support and what level of forecasting capabilities the organization currently has.
Existing evaluation and audit frameworks already provide useful guidelines on assessing current forecasting capabilities. Mentzer et al. [44] developed a sales forecasting management evaluation framework and benchmark. They identified four dimensions of the forecasting process, including functional integration, ap- proach, systems and performance measurement. For each of these dimensions, four stages are described in detail so companies can benchmark their current forecasting performance. Moon et al. [46] provide a literature review of several forecasting audit frameworks and extended the framework and benchmark from Mentzer et al. [44]. They also identified some strategic themes that many of the companies they interviewed were facing. One important part of the assessment is to evalu- ate the suitability of the forecasting performance measures that are used. In 2008, Fildes et al. [24] reported that based on observations in practice they had little confidence that forecasting systems and companies used the appropriate forecasting performance measures.