EVALUATION OF FORECASTING METHODS
13. Evaluating Methods
The principles for evaluating forecasting methods are based on generally accepted scientific procedures.
13.1 Compare reasonable methods.
Description: Use at least two methods, preferably including the current procedure as one of these. Exclude methods that unbiased experts would consider unsuitable for the situation.
Purposes: To select the best method and improve methods.
Conditions: Whenever biases can affect the evaluation (which is often). Knowledge of alternative approaches is helpful.
Strength of evidence: Some empirical support.
Source of evidence: Armstrong, Brodie and Parsons (2001) provide evidence showing that this principle reduces a researcher’s biases.
13.2 Use objective tests of assumptions.
Description: Use quantitative approaches to test assumptions.
Purposes: To select the best method and to improve methods.
Conditions: Tests are relevant for important assumptions, assuming that you can obtain objective assessments.
This is relevant only for cases where you are uncertain about the validity of the assumptions.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.3 Design test situations to match the forecasting problem.
Description: Test forecasting methods by simulating their use in making actual forecasts. For example, to assess how accurate a model is for five-year-ahead forecasts , test it for five-year-ahead out-of-sample (ex ante) forecasts.
(This is related to Principle 6.7.)
Purposes: To select the best method and improve methods.
Conditions: You need knowledge of alternative approaches and the situation.
Strength of evidence: Some empirical support.
Source of evidence: Armstrong (2001c) summarizes evidence, much of which comes from studies of personnel selection.
13.4 Describe conditions associated with the forecasting problem.
Description: Ideally, these conditions will be similar to those in other situations, allowing for a comparison of the present situation with others.
Purpose: To apply appropriate methods for new situations.
Conditions: Whenever you need to generalize to new situations.
Strength of evidence: Common sense.
Source of evidence: None.
13.5 Tailor the analysis to the decision.
Description: Often the proper analytic procedure will be obvious, but not always.
Purpose: To ensure proper use of forecasts.
Conditions: This is relevant when it is not immediately obvious how to compare forecasting methods. Armstrong (2001c) describes situations in which it is not clear how to analyze the information.
Strength of evidence: Common sense.
Source of evidence: None.
13.6 Describe potential biases of forecasters.
Description: Describe biases that might affect forecasters or their methods.
Purpose: To select the best method and to improve methods.
Conditions: Adjust for biases especially when the forecasting process relies on judgment.
Strength of evidence: Strong empirical support.
Source of evidence: Armstrong (2001c) summarizes evidence from studies of government revenue forecasts, political polls, and government deregulation.
13.7 Assess the reliability and validity of the data.
Description: Provide quantitative assessments of validity and reliability.
Purpose: To improve forecast accuracy.
Conditions: It is important to assess data quality when forecasting the effects of alternative policies. Armstrong (2001c) discusses a study that concluded that increases in the minimum wage would help unskilled workers.
However, the study had serious problems with the reliability of the data and a reanalysis with corrected data reversed the findings.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.8 Provide easy access to the data.
Description: If the data are easily available, replications can be done. Given the evidence on the difficulty of replicating findings in management science (Armstrong 2001c), the principle is important.
Purpose: To reliably assess the accuracy of alternative methods.
Conditions: Full access to data is particularly important when forecasts might be affected by biases. Sometimes, reanalysis of data yields different results. Websites can now make full disclosure of data inexpensive. For example, data from the M-Competitions are available on the Forecasting Principles website.
Strength of evidence: Common sense.
Source of evidence: None.
13.9 Provide full disclosure of methods.
Description: Detailed descriptions of the methods can allow others to audit forecasting methods and to replicate them. Whereas full disclosure used to be expensive due to limited space in journals, it can now be accomplished by putting methodological details on websites.
Purposes: To select the best method and improve methods.
Conditions: Full disclosure is most important when the methods require judgmental inputs or when the methods are new to the situation.
Strength of evidence: Received wisdom.
Source of evidence: Armstrong (2001c) provides evidence on the value of this principle.
13.10 Test assumptions for validity.
Description: Provide quantitative assessments of the validity of the assumption. This includes face, construct, and predictive validity.
Purpose: To assess the accuracy of forecasts.
Conditions: This is important when comparing the effects of proposed alternative policies.
Strength of evidence: Common sense.
Source of evidence: None.
13.11 Test the client=s understanding of the methods.
Description: A method that is easy to understand might be preferable even if it reduces accuracy. In practice, the clients often do not understand the methods. This principle is related to Principle 1.5 (obtain agreement on methods).
Purposes: To select the most appropriate forecasting method and to increase the likelihood that it will be used properly. For example, the client should be able to identify when the methods need to be revised.
Conditions: It is important to understand the methods if key aspects of the problem are likely to change.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.12 Use direct replications of evaluations to identify mistakes.
Description: By redoing evaluations, one can check for mistakes. Researchers have replicated the M-Competition studies and have identified some mistakes. (However, these mistakes did not alter the conclusions.)
Purpose: To check for mistakes in comparisons of methods.
Conditions: Replication is especially useful for complex methods and when forecasts might be affected by biases.
Strength of evidence: Weak evidence.
Source of evidence: Armstrong (2001c) reviews four studies showing that mistakes occur often in forecasting.
13.13 Replicate forecast evaluations to assess their reliability.
Description: Replications provide the best way to assess reliability. However, replications are seldom used in management science (Hubbard & Vetter 1996).
Purpose: To obtain reliable comparisons of alternative forecasting methods.
Conditions: Replication is especially important when the data are likely to be unreliable, biases are likely, and when forecast errors can have serious consequences.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.14 Use extensions of evaluations to better generalize about what methods are best for what situations.
Description: This involves replications that contain variations in important elements of the situation or method.
Purpose: To ensure use of the proper forecasting methods.
Conditions: Extensions are important when you expect to use the forecasting procedure for a wide range of problems.
Strength of evidence: Some indirect empirical support.
Source of evidence: Hubbard and Vetter (1996), in their review of published extensions in accounting, economics, finance, management, and marketing, found that 46 percent of the findings differed from those in the original study.
13.15 Conduct extensions of evaluations in realistic situations.
Description: When evaluating alternative forecasting methods, do so in situations that provide realistic representations of the forecasting problem.
Purpose: To ensure use of the proper forecasting methods.
Conditions: This is important when a situation involves large changes and when forecast errors have serious consequences.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.16 Compare forecasts generated by different methods.
Description: Comparisons of forecasts from different methods can be used to examine forecast accuracy and to assess uncertainty. Armstrong (2001c) discusses this issue.
Purpose: To ensure use of the proper forecasting methods.
Conditions: This principle applies when the situation permits the use of multiple methods. It is especially useful when methods differ substantially.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.17 Examine all important criteria.
Description: Yokum and Armstrong (1995) describe various criteria along with ratings of their importance by decision makers, practitioners, educators, and researchers.
Purposes: To improve acceptance of the proposed methods and to ensure that they meet the needs of the decision makers.
Conditions: Good knowledge of the problem is needed in order to evaluate all important criteria (e.g., accuracy, ability to assess uncertainty, cost). The importance of criteria varies by conditions (e.g., long term vs. short term) and by methods (e.g., extrapolation vs. econometric methods). This principle is especially important when biases are likely.
Strength of evidence: Common sense.
Source of evidence: None.
13.18 Specify criteria for evaluating methods prior to analyzing data.
Description: List the criteria in order of importance before analyzing the data.
Purpose: To help in selecting proper forecasting methods.
Conditions: This is important when different methods yield substantially different forecasts, when judgmental inputs are important, or when biases may have a strong influence.
Strength of evidence: Some empirical support.
Source of evidence: Armstrong (2001c) summarizes evidence on the need to prespecify criteria.
13.19 Assess face validity.
Description: Face validity involves asking whether the evaluation study makes sense to independent unbiased experts. Assess face validity in a structured way (e.g., by using questionnaires) to obtain expert opinions.
Purpose: To ensure the use of the proper methods and to gain acceptance of the forecasts.
Conditions: Face validity is important when large changes are expected.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.20 Use error measures that adjust for scale in the data.
Description: Ensure that the comparisons among methods are not distorted by one series having larger numbers than other series and thus being weighted more heavily. One can use error measures that are expressed as percentages to adjust for scale.
Purpose: To help ensure the use of the proper forecasting methods.
Conditions: When comparing across different situations (e.g., across different time series), you need error measures that are not unduly influenced by a small number of series. This is important when dealing with heterogeneous time series.
Strength of evidence: Received wisdom.
Source of evidence: None.
13.21 Ensure error measures are valid.
Description: Error measures should relate to the decision being made, such as to determine which is the most accurate method.
Purpose: To help ensure use of the proper forecasting methods.
Conditions: In general, evaluation studies should be concerned with the validity of the error measures.
Strength of evidence: Common sense.
Source of evidence: None.
13.22 Use error measures that are not sensitive to the degree of difficulty in forecasting.
Description: This principle prevents the evaluation from being dominated by a few series that have very large forecast errors. Apply this principle when some series are subject to large changes. Ohlin and Duncan (1949) identified the need for this principle. Relative absolute errors (RAE) compensate somewhat for differences in the difficulty of forecasting series (Armstrong & Collopy 1992).
Purpose: To properly assess the relative accuracy of different methods.
Conditions: This principle applies only when generalizing across time series that vary in their forecasting difficulty.
Strength of evidence: Common sense.
Source of evidence: None 13.23 Avoid biased error measures.
Description: Do not use an error measure favoring forecasts that are systematically high (or low). Armstrong (2001c) describes this issue and how to resolve it.
Purpose: To properly assess relative accuracy.
Conditions: This applies when one needs to assess forecasts that cover a wide range of values and is especially relevant for non-negative time series.
Strength of evidence: Common sense.
Source of evidence: None.
13.24 Avoid error measures that are highly sensitive to outliers.
Description: Armstrong (2001c) describes error measures that offer protection against the effects of outliers.
Purpose: To properly assess relative accuracy.
Conditions: This principle is only needed when outliers are likely. However, if it is the outliers that are of concern, such as hurricanes or floods, ignore this principle.
Strength of evidence: Common sense.
Source of evidence: None.
13.25 Use multiple measures of accuracy.
Description: Armstrong (2001c) describes a variety of error measures.
Purpose: To properly assess the relative accuracy of alternative forecasting methods.
Conditions: Use multiple measures when there is uncertainty about the best error measure.
Strength of evidence: Received wisdom.
Source of evidence: Armstrong (2001c) shows how evaluations of alternative forecasting methods can differ depending upon the error measure chosen.
13.26 Use out-of-sample (ex ante) error measures.
Description: Conditional (ex post) error are not closely related to ex ante errors.
Purpose: To properly assess the relative accuracy of fo recasting methods.
Conditions: Ex ante error measures are especially important for time series that include moderate to large changes.
Strength of evidence: Strong empirical evidence supports this principle, which conflicts with common practice and with recommendations by statisticians.
Source of evidence: Armstrong (2001c) summarizes evidence from six studies.
13.27 Use ex post error measures to evaluate the effects of policy variables.
Description: Assuming that changes in the explanatory variables were correctly forecast, how well does the model predict the effects of policy changes?
Purpose: To determine how effectively methods can forecast the outcomes of policy changes (e.g., to examine the effects of different price levels for a product).
Conditions: Ex post tests are important when decision makers want to access the outcomes of alternative policies, such as when using econometric models. In addition, ex post tests help improve econometric models by showing the sources of error.
Strength of evidence: Common sense.
Source of evidence: None.
13.28 Do not use R-square (either standard or adjusted) to compare forecasting models.
Description: R-square ignores bias and it has little relationship to decision-making.
Purpose: To avoid improper evaluation of the accuracy of methods.
Conditions: R-square is a misleading measure for comparing time series models although it may have some relevance for cross-sectional data.
Strength of evidence: This principle is in conflict with received wisdom and there is some empirical evidence.
Source of evidence: Armstrong (2001c) describes the problems associated with the use of R-square.
13.29 Use statistical significance only to compare the accuracy of reasonable methods.
Description: Little is learned by rejecting an unreasonable null hypothesis. When comparing accuracy, adjust the significance level for the number of models that are compared when more than two models are involved.
Purpose: To avoid improper evaluation of the accuracy of forecasting methods.
Conditions: Statistical significance can be misleading in forecasting time series because of autocorrelation or outliers. It can be useful, however, in making comparisons of reasonable methods when one has only a small sample of forecasts.
Strength of evidence: Received wisdom.
Source of evidence: Armstrong (2001c) describes studies showing the dangers of using statistical significance.
13.30 Do not use root mean square errors (RMSE) to make comparisons among forecasting methods.
Description: The RMSE is an unreliable measure for comparing forecasting methods.
Purpose: To avoid improper evaluation of the accuracy of methods.
Conditions: The RMSE is not needed in forecasting. More appropriate procedures exist. Using root mean squares can be especially misleading when you are dealing with heterogeneous time series.
Strength of evidence: There is strong empirical support, and it conflicts with received wisdom.
Source of evidence: Armstrong and Fildes (1995) summarize evidence on this issue.
13.31 Base comparisons of methods on large samples of forecasts.
Description: For time series, use many series, horizons, and origins. Try to obtain forecasting cases that are somewhat independent of one another. To the extent that they are not independent, use larger samples of forecasts.
Armstrong (2001c) discusses how to expand the sample of forecasts.
Purpose: To assess the accuracy of alternative forecasting methods.
Conditions: Relevant primarily for time series. It must be possible to obtain many forecasts from similar situations.
Strength of evidence: Received wisdom.
Source of evidence: Armstrong (2001c) summarizes evidence on the need for large samples.
13.32 Conduct explicit cost-benefit analyses.
Description: Examine the costs and benefits of each forecasting method.
Purpose: To select the most appropriate forecasting method.
Conditions: This is relevant when the cost of forecasting may exceed the potential benefits.
Strength of evidence: Common sense.
Source of evidence: None.