Model Fit and Model Complexity - Bayesian Multi-Model Frameworks

3.1.1 Overfitting and Underfitting

In statistical terms, conceptual uncertainty in modelling manifests itself as so- called overfitting or underfitting (e.g. Burnham and Anderson, 2002). An overfitted model closely matches available (within-sample) data D, but struggles with reliably predicting (out-of-sample) data D0. Such models are usually too flexible and tend to fit patterns in D that do not truly exist (e.g., are just caused by noise) which deteriorates the prediction of D0. An underfitted model roughly meets the trend of D, but struggles with following the actual pattern around the plain trend and therefore also in predicting D0. Instead of being too flexible, an underfitted model is not flexible enough. As a rule of thumb, underfitting implies large bias, i.e., systematic error between model predictions and data, but typically also low variance of the predictions - overfitting implies the reverse. Bias and variance are visualized in Figure 5.

An illustration of both, overfitting and underfitting, is given by the simple binary classification problem in Figure 6, generated using Python’s scikit-learn package (Pedregosa et al., 2011). There, three different models shall classify 250 points (with two unspecified attributes on the axes) either by a red or blue label in a way that also further points will be correctly classified. We assume that the points

Er ror Complexity Impr ec ise (Hig h Var . ) Accurate (Low Bias) Inaccurate (High Bias) Pr ec ise (Lo w Var . ) Variance Bias² Bias² + Variance Optimum (a) (b)

Figure 5: Concepts and effects of bias and variance: (a) Accuracy and precision visualized as bias and variance of shots on a target. Bias is the distance between the target center and the average position of shots. Variance is the spread of shots around their average. (b) Decomposition of total squared error into squared Bias and Variance (after e.g. Friedman et al. (2001)). Bias is supposed to decrease and Variance is supposed to grow with increasing model complexity, both due to growing model flexibility. Their superposition forms a minimum that marks optimal model complexity (from H¨oge et al., 2018).

are correctly separated by a smooth S-shaped curve with some noise that explains switched labels in the fringe zone. The model on the left separates the two classes by a straight line which clearly underfits the pattern of the data. The middle one appears as reliable estimation of the underlying classification model. The model on the right overfits and adopts also to noise rather than only the pattern of the data.

Figure 6: Illustrated underfitting (left), proper fitting (center) and overfitting (right) in binary classification of data to a red or blue class with unspecified attributes on the axes.

In regression, over- and underfitting can be easily illustrated by the following standard example: Being given 10 data points, a 9th order polynomial will yield perfect fit with zero residuals. Every lower-order polynomial will underfit and provide worse fit - the lower, the worse. Every higher-order polynomial will also perfectly fit the available data but also overfit: between the 10 points and when

leaving the within-sample data range, the “wiggling” of the polynomial model will become the stronger, the higher the order of the polynomial is. Note, that if the data are prone to error, the perfect fit of the 9th order polynomial is actually a misfit that can already be interpreted as overfit.

The fear to overfit or to underfit is typically implied when modellers refer to model complexity (see Figure 5). Typically, the excessive flexibility of overfitted models is assumed to come from too many parameters, functional terms, highly non-linear relationships, etc. They overestimate the complexity of the DGP and therefore fail to explain or to predict it. Overfitting poses a more frequently encountered problem than underfitting. It becomes apparent as, e.g., nonuniqueness of calibration or poor parameter identifiability (e.g. Schoups et al., 2008). Underfitting refers to the other extreme, where models underestimate the system complexity and are too simple to fully resolve the patterns of the DGP hidden in the data, i.e., to decipher the full system complexity.

3.1.2 Model Complexity Control

Reliable and successful modelling requires model complexity control (Schoups et al., 2008). I suggest to distinguish within-model and between-model complexity control:

• Within-model complexity control for a single model means limiting its flexibility.

• Between-model complexity control between multiple models (of typically de- viating complexity) refers to either finding one model that suffers the least from overfitting or underfitting, or to employing models of different complex- ities together in order to mutually compensate individual shortcomings. Within a model, complexity control is achieved by so-called regularization. This technique is applied throughout model calibration or conditioning in primarily ill- posed problems (e.g. Tarantola, 2005). Regularization means to provide further information to a model rather than only the data for calibration. Effectively, this additional information delimits the model output and therefore counteracts overfitting by reducing model flexibility or underfitting by preventing the extraction of false trends.

Typically, this additional information concerns the parameters and enables to constrain them during calibration, e.g., by preventing extreme parameter values. Common examples of regularization are the so-called LASSO or Tikhonov regularization (Marconato et al., 2013; Bardsley et al., 2015; Vaiter et al., 2015),

which respectively apply a L1- or L2-norm on the parameter values. When models are operated in a probabilistic (Bayesian) framework, they are assigned prior parameter distributions, which automatically act as such additional information. Therefore, applying a Bayesian prior is nothing else than putting a regularization on the model parameters (e.g. MacKay, 1992; VanderPlas, 2014). The commonly used L1- and L2-norms directly correspond to a Bayesian Laplace or Gaussian prior, respectively.

Depending on the model type, models sometimes naturally contain constraints, e.g., by enforced physical principles like conservation laws. This additional information prevents such models to fit “non-sense” patterns in the data. For example, in hydrosystem models, mass balance prevents that fitted discharges or concen- trations can obtain negative values due to their physical constraints. This can be considered as sort of model-type specific regularization in the sense of additional information.

Between models, complexity control is achieved by model rating and subsequent selection or combination (via averaging as discussed in Section 2.3). In order to account for structural deficiencies that lead to overfitting and underfitting of single models, competing models with the same target QoI but with different complexity even ought to be set up and tested. Between-model complexity control means, then, to rate these competitors under inclusion of a certain model complexity representation (law of parsimony), and elicit the model with the most appropriate complexity for the modelling task at hand. Based on the rating scores that the models achieve, a single model or model combination is found that resembles the appropriate complexity for the model task at hand.

Modellers typically refer to a rather vague notion of model complexity in the con- text of underfitting and overfitting. Höge et al. (2018) systematically analysed and discussed the model selection criteria from Section 2.5 with respect to their specific takes on model complexity. There, the explicit representation of model complexity within each class B1, B0, A1 or A0 conveys a distinct meaning. The decisive role thereof will be highlighted in the following. As in Höge et al. (2018), I will discuss it in the extremes of the M-closed and M-open setting, in the following also referred to as finite and infinite dimensional truth, respectively. With the exception of Sections 3.2.2, 3.4.4 and the corresponding class-specific model complexity evaluations in Appendix B, the rest of the remaining Chapter 3 has been published in Höge et al. (2018) and I reuse parts of the text, figures and tables. Considering my co-authors, “I” is substituted by “we”.

3.2 The Role of Model Complexity within Model Selection

In document Bayesian Multi-Model Frameworks - Properly Addressing Conceptual Uncertainty in Applied Modelling (Page 68-72)