2. Literature Review
2.2 Immediate discipline and analytical models
2.2.5 Model-selection & Model-averaging: Optimal-calibration & Runtime-weighted
This section discusses model-selection and model-averaging to support the development of ‘
runtime-weighted model-averaging’ and ‘optimal-calibration model-averaging’ methods because the existing model-selection criteria are unsuitable for creating model weights for the AIE model. The discussion examines why the existing model-selection criteria are unsuitable, leading to the
‘runtime-weighted model-averaging’ formula in equation (2–16) and the ‘optimal-calibration model-averaging’ method in section 2.2.5.2. These methods are developed further in section 3.5. To that end, this section discusses the link between the Bayesian Information Criteria (BIC), a model-selection method, and Universal Intelligence (Legg & Hutter 2007, p. 23), a model- averaging method. They are linked because both methods seek to balance a model’s goodness of fit with a model’s complexity by rewarding the former and penalising the latter. Components from the BIC and Universal Intelligence are used to develop the weighting formula in equation (2–16).
The structure of the section is as follows. Section one discusses model-selection and the problems with applying the BIC to form weights for the AIE model. Section two introduces the ‘runtime- weighted model-averaging’ and ‘optimal-calibration model-averaging’ methods.
2.2.5.1 Model-selection
The BIC is used in model-selection, where the model with the lowest BIC is the preferred model. The thesis uses Green’s (2003, p. 160) version of BIC shown in Equation (2–11). The BIC is also know as the Schwarz information criterion (SIC) after its originator Schwarz (1978).
BIC(k) = log σ2 + ( k log n ) / n (2–11) Where
k = the number of parameters in the model n = sample size
σ2
= model variance
Note that the form of equation (2–11) differs to the original version. Equation (2–11) has a trade off between goodness of fit and parsimonious specification. A decrease in the BIC results from a decrease in model variance, which means that the model has a higher goodness of fit. A way to increase goodness of fit is to increase the number of parameters. However, increasing the number of parameters increases the BIC. This outline demonstrates the trade off between goodness of fit
and parsimony, a search for an explanation in the simplest possible terms. The BIC is a method to implement Occam’s razor that states ‘entities should not be multiplied beyond necessity – or – keep the simplest theory consistent with the observations.’ The other main model-selection criteria is the Akaike Information Criteria (AIC) (Akaike 1974). This thesis uses the BIC in preference to the AIC because Schwarz (1978) proves that his information criteria optimally penalises models for complexity.
However, there are two reasons why the BIC is inappropriate for selecting among the AIE models. Firstly, each network structure in the AIE model has multiple equilibria. Secondly, the definition of complexity of the BIC is inapplicable to the network topologies of the AIE model.
The first reason for BIC unsuitability is the multiple equilibria in the AIE model. In a strict application of BIC, the multiple equilibria would be ignored to select the global minimum but all equilibria are a plausible solution. However, it was considered too impractical to determine the multiple equilibria in the AIE model, given the computational time required and the limitations of the mathematical techniques available. Section 5.6.7, in further research, further discusses multiple equilibria and phase changes.
The second reason for BIC unsuitability is that BIC calculates complexity as a function of the number of variables in the model but the AIE model has a fixed number of variables and the complexity of the model varies greatly by altering two variables, ‘the probability of a link being rewired’, and ‘the number of links in a network’. An approximation to the level of complexity could be made by equating ‘the number of links in a network’ to the level of complexity, which could be used in a modified BIC. However, the level of complexity is two dimensional and not easily ranked, where ‘the probability of a link being rewired’ is the other dimension.
The thesis uses model-averaging to solve the ranking problem because each network topology in AIE has a unique structure and is a model in its own right, therefore amenable to model averaging. The network topology in AIE is determined by the following three variables ‘the number of firms’,
‘the probability of a link being rewired’ and ‘the number of links in a network’. The ‘number of firms’ is fixed at 200, hence the latter two variables determine the network topology. Section 2.2.2 discuses the 121 structures used in AIE as the product of the 11 settings for ‘the number of links in the network’ and 11 settings for the ‘probability of a link being rewired’. The thesis uses the term
The network-averaging uses ‘equal-weighted model-averaging’ in the first instance to solve the ranking problem, which also improves predictive performance. Furthermore, the thesis develops two model-averaging techniques that address the ranking problem and are tested for their ability to improve predictive performance, ‘runtime-weighted model-averaging’ and ‘optimal-calibration model-averaging’. ‘Runtime-weighted model-averaging’ addresses the ranking problem directly by creating an alternative measure of complexity that builds on Hutter (2005) and Legg and Hutter (2007). In comparison, ‘optimal-calibration model-averaging’ finds an indirect solution to the ranking problem. Section 2.2.5.2 further discusses the model-averaging solutions to the ranking problem.
2.2.5.2 Model-averaging
Bates and Granger (1969) introduce ‘model-averaging’ to improve forecasting accuracy. Clemen (1989) reviews the combining forecasts literature and concludes that (1) forecast accuracy is substantially improved by combining multiple individual forecasts, and (2) simple combinations of models often work reasonably well, compared to more complex methods. His review discusses combining differing models to improve forecast accuracy or ‘model-averaging’ (Bates & Granger 1969). Model-averaging has an extensive literature; see Garratt et al. (2007), Fernandez, Ley and Steel (2001), O'Hagan (1995) and Garratt et al. (2003).
The structure of this section is as follows. Section one discusses the development of ‘runtime- weighted model-averaging’ that builds on Hutter’s (2005) ‘Universal Intelligence’ and section two discusses the development ‘optimal-calibration model-averaging’.
1. Runtime-weighted Model-averaging
Hutter (2005) introduces ‘Universal Intelligence’ and Legg and Hutter (2007) introduce ‘Universal Artificial Intelligence’. They provide a model-weighing framework that is able to accommodate any combination of environment and agent. However, this framework is practically incomputable, which requires that suitable proxies for the framework’s components be developed.
Hutter (2005, p. 30) discusses the weighting method as combining Epicurus’ principle of multiple explanations and Occam’s razor. Epicurus’ principle of multiple explanations is, ‘if more than one theory is consistent with the observations, keep all the theories.’ Equation (2–12) defines the universal intelligence of an agent π, which combines both Occam’s razor and Epicurus’ principle (Legg & Hutter 2007, p. 23).
(2–12)
Where
π = an agent
μ = an environment
E = a wide range of environments that have well defined rewards K = Kolmogorov complexity function
V = value function
Legg and Hutter (2007, p. 24) state that the ability of the agent π to achieve in environment μ is represented by . This ability of the agent would correspond to some inverse function of the model variance of the AIE model. The environments E in the ‘universal intelligence’ framework would correspond to 121 network topologies in the AIE model. They use the term to represent Occam’s razor. This term weights the agent's performance in each environment inversely proportional to its complexity. The Kolmogorov function represents any environment by the shortest non–repeating binary string. This function is not computable, therefore requires a proxy. Levin’s (1973, p. 266) Kt complexity provides such a proxy, which considers that the complexity of an algorithm is determined by both its minimal description length and running time. Levin complexity makes the assumption that Universal Turing machines are able to simulate each other in linear time to retain invariance with Kolmogorov complexity (Legg & Hutter 2007, pp. 36, 9). The time t for each network structure of the AIE model to run becomes a proxy for complexity. Each of the 121 network topologies has different running times; generally the more links L in the network the longer the running time, and intuitively more complex. The probability of a link being rewired
ρ has the general effect of making the running time longer; again intuitively more complex.
Equation (2–13) shows the complexity component of the BIC formula in equation (2–11) replaced with Levin’s complexity Kt, where t is the model runtime and K is the ‘runtime-weighted constant’
denoted by c and determined experimentally. The ‘runtime-weighted constant’ will vary according to the speed of the computer running the AIE model, but using the same computer to measure the runtime for all the versions of the AIE model would prevent this problem. Alternatively, each computer could be benchmarked by using the runtime of a standard AIE model that becomes the
unit-time for each computer. This allows for a quasi universal ‘runtime-weighted constant’ c after normalising the times.
BIC* = log σ2 + ( ct log n ) / n (2–13) Where
* denotes a modification to representing complexity that is using Levin’s complexity
Kt denoted ct to replace the BIC complexity measure k in equation (2-11) t = the time for the model to run
c = runtime-weighted constant determined experimentally
Now to address the 121 network topologies using model-averaging, Kass and Raftery’s (1995, p. 773) note that Bayes-factors may be converted to weights for the various models to make composite estimates. Equation (2–14) shows Kass and Raftery’s (1995, p. 791) observation that the BIC gives a rough approximation to the logarithm of the ‘Bayes-factor’ (K), which is easy to use and does not require evaluation of the prior distribution.
log K ≈–(n/2) BIC (2–14) Where ≈ denotes approximately
From equation (2–13) and equation (2–14)
log K* ≈ –(n/2) (log σ2 + ( ct log n ) / n ) log K* ≈ –(n/2) log σ2 + –(1/2) ( ct log n )
K* ≈σ–n n–ct/2 (2–15)
Does equation (2–15) make sense? Equation (2–15) conforms to the three observations about equation (2–12). The first observation is a fit measure inversely proportional to some function of the model variance. The second observation is a complexity penalty. The third observation is that the weight is a product of fit measure and complexity penalty measure. Equation (2–15) has the following two additional benefits related to n, the number of observations. The first benefit is that between two models with the same variance the model that has a larger number of observations has a higher weight. This observation makes sense because we can be more confident that the model with the larger number of observations has a more accurately determined model variance, therefore, more confidence that the model fails to fit the data. The second benefit is that between two models
with the same runtime or complexity the model with the larger number of observations is more heavily weighted. This observation also makes sense because it rewards a model for fitting more data points. However, a drawback to equation (2–15) is the requirement to determine the‘runtime- weighted constant’ c experimentally. Section 3.5.1 discusses the method to find the ‘runtime- weighted constant’c.
Equation (2–16) shows the ‘Bayes-factor’ from equation (2–15) used to form a weight for each model.
wm = σ–nm n^(–ctm/2) (2–16)
ΣMiσ–ni n^(–cti/2)
Where
wm = weight for each model m
M = the number of models
The derivation of the weight in equation (2–16) assumes that theorem 2 of Levin’s (1973, p. 266) complexity is Kt when it is in fact Kt + c. However, equation (2–16) can be derived from either form of Levin’s complexity and the derivation from the simpler form aids clarity.
Fernandez, Ley and Steel (2001, p. 387) note that Bayes-factors are known to be rather sensitive to the choice of the prior distribution for the parameters within each model. The AIE model has no such problem because the parameters are exact values chosen for each simulation. The priors are point mass at the chosen values of the parameters. Thus, there is little need to consider complex priors in model-averaging.
2. Optimal-calibration Model-averaging
This thesis introduces ‘optimal-calibration model-averaging’ as an alternative approach to
‘runtime-weighted model-averaging’. As discussed, ‘runtime-weighted model-averaging’ directly addresses the inadequacy of the BIC to select among models with varying degrees of complexity, but with a fixed number of variables, by using the ‘runtime’ as a proxy for complexity. In contrast,
‘optimal-calibration model-averaging’ avoids the complexity issue by simply ranking the 121 models in the order of model variance, which uses equal weights and simply model-averages the first two models, the first three models, the first four models and so on, until the 120 model- averaged combinations are calculated. The combination of models averaged with the lowest model
variance becomes the optimal number of models to average. The predictions from this optimal number of models are averaged to form the optimal-calibration prediction. Section 3.4.2 further discusses the method to find the optimal number of models. The ‘runtime-weighted model- averaging’ and ‘optimal-calibration model-averaging’ techniques are benchmarked against the
‘equal-weighted model-averaging’ and ‘Bayes-factor model-averaging’ techniques.
The literature supporting the component parts of the AIE model is now in place ready to discuss the arising research questions.