3.2 The Role of Model Complexity within Model Selection Criteria
3.2.1 Consistency in Model Selection
Accordingly, consistent model selection is sometimes also called confirmatory (Aho et al., 2014), i.e. confirming the identified DGP by the given data D in hindsight. Non-consistent model selection is also called conservative (Leeb and P¨otscher, 2009) or exploratory (Aho et al., 2014), i.e. the model selected to approach the DGP is appropriate to conservatively predict or explore new data D0 in foresight. In the past, it was discussed whether the two types of model selection are (anti-) correlated (e.g. MacKay, 1992) or uncorrelated (e.g. Bishop, 1995) with each ot- her. Although such behaviour might appear coincidently, it was generally shown that any model selection method cannot be optimal in both respects (Hurvich and Tsai, 1989; Yang, 2005; Arlot and Celisse, 2010).
Illustrative thought experiment
The exploratory or confirmatory natures of the two model selection types can be illustrated by a simple thought experiment: Imagine two modelers A and B who seek to model a controlled laboratory experiment (e.g. a tracer flow-through co- lumn experiment). Due to the fully controlled conditions it can be assumed that this lab-scale truth is of (relevant) finite dimensionality. Modeler A, e.g. an en- gineer or manager, assumes that there are too many dimensions to be covered by a fixed parametric model, but still wants to find the best model for future pre- dictions. Accordingly, she picks a type of model which is allowed to grow with incoming new information and starts off with operational data-driven models, e.g. regressive models. Modeler B, e.g. a fundamental scientist, wants to identify the true data-generating process and hence prefers parametric physics-based models.
One might think that the two purposes are the same thing, but from the perspecti- ves of non-consistent vs. consistent model selection, they are not.
Each of them starts with three models of their preferred model type with increa- sing complexity: A simple first model, a more complex second model and a highly complex third model. Let’s assume that the second model of modeler B actually represents the truth (which is an idea borrowed from consistent selection), i.e. employs the right physical equations. On the same level of complexity, the second model of modeler A mimics the data best, but as a data-driven empirical model it is clear that it cannot represent the true data-generating process.
Both modelers collect and use the same data continuously in order to perform a model selection procedure as soon as a new batch of data, i.e. new and non- redundant information, comes in. According to her modelling purpose, modeler A uses a non-consistent model selection criterion targeting the highest predictive performance. Modeler B performs consistent model selection to identify the truth and to understand the underlying physics. This procedure is shown schematically in Figure 8. 0 0.5 1 1 2 3 0 0.5 1 1 2 3 0 0.5 1 1 2 3 0 0.5 1 1 2 3 0 0.5 1 1 2 3 0 0.5 1 1 2 3 0 0.5 1 1 2 3 Mode l p re fer en ce Models
little data much data
Phase 1 Phase 2 Phase 3
Non- Consistent
Consistent Phase 0
Figure 8: Differences in model rating following non-consistent (A-type) and consistent (B-type) model selection for increasing data size. The models are rated on a normalized scale between 0 and 1. Models 1, 2 and 3 resemble increasing stages of complexity (from H¨oge et al., 2018).
In Phase 0, before having any data, both modelers start with uniform model choice preferences across their candidate models. In Phase 1, with little data available, no complex model can be supported, so the simple first model of each modeler is selected. However, with more incoming and informative data (Phase 2), a more complex model provides a better trade-off between fit and complexity. Hence, the second models of both modelers get selected by their respective criteria. With more and more data becoming available in Phase 3, the two rankings become fun- damentally different in the large sample limit: For modeler B the third physical model (which is more complex than the truth) will never stand a chance in a model selection process in the long run. Its additional complexity would be called exces- sive. However, the third data-driven model of modeler A can be justified as the model with the best trade-off between fit and complexity from an non-consistent perspective.
This is because, for modeler B, the second model revealed itself as representing the data-generating process, and as such a simpler (1st model) or more complex (3rd) model is rejected by the consistent model selection procedure. For modeler A it was clear from the beginning on that the truth is not among the data-driven candidate models. Then, a more complex model is justifiable with more available observations. More data reduces the risk of just fitting noise, so a more complex model from the efficiency perspective is confident with yielding the best future predictions and wins the model selection.
The illustrated behavior of consistent model selection, i.e. to identify and stick to the best representation of the truth, can be found in Sch¨oniger et al. (2015a). In this study on mechanistic models for a laboratory-scale artificial aquifer, several increasingly complex parametrizations of the hydraulic conductivity distribution are ranked. Under growing data size, the consistent selection procedure converges towards the model that represents the true zonated distribution, and it devaluates simpler (homogeneous) and more complex (geostatistical) approaches. Contra- rily, the tendency of non-consistent model selection to prefer increasingly complex models is demonstrated in Vrieze (2012) for regression models.