Model evaluation framework - 8 Self-Organizing Financial Stability Map (SOFSM)

8 Self-Organizing Financial Stability Map (SOFSM)

8.2 Model evaluation framework

Crisis data require evaluation criteria that account for their complex nature. Crises are oftentimes outlier events in three aspects:

i) they differ significantly from tranquil times, ii) they are commonly more costly, and iii) they occur more rarely.

Given these properties, especially the two latter ones, I show that the evaluation framework in Paper 6 better resembles the decision problem faced by a policy-maker. I first briefly review the literature on evaluating early-warning models and then discuss a general framework for deriving a policymaker’s loss function and the Usefulness of a model.

While an own strand of literature has focused on the evaluation of early-warning models, the utilized measures seldom cover the wide spectrum of factors that may concern a policymaker. The seminal study by Kaminsky et al. (1998) utilized the simple noise-to-signal ratio to set an optimal threshold value.¹⁵ Based upon Re-ceiver Operating Characteristics (ROC) curves and the area below them, measures applied by Sarlin and Marghescu (2011a) to early-warning model evaluations, Jord´a and Taylor (2011) formulated a Correct Classification Frontier (CCF) with advan-tages like providing visual means and summarizations of results for all possible thresholds. Yet, the measures do not properly pay regard to varying misclassifica-tion costs and imbalanced data, and suffer from the fact that some thresholds may

15The noise-to-signal ratio is a ratio of the probability of receiving a signal conditional on no crisis occurring to the probability of receiving a signal conditional on a crisis occurring. Demirg¨u¸ c-Kunt and Detragiache (2000) and El-Shagi et al. (2012) showed that minimizing the noise-to-signal ratio could lead to a relatively high share of missed crisis episodes (i.e., only noise minimization) if crises are rare and the cost of missing a crisis is high. This type of a common corner solution to the optimization problem is mainly due to the fact that the marginal rate of substitution between type I and II errors is unrestricted. Lund-Jensen (2012) concludes the same, and chooses not to use the measure, while Drehmann et al. (2011) choose to minimize the noise-to-signal-ratio subject to at least two thirds of the crises being correctly called. Likewise, Paper 6 also illustrates such a corner solution.

be far from policy relevant (e.g., both ends of the CCF). Likewise, while the com-prehensive toolbox for evaluating early-warning models by Candelon et al. (2012) provides significant contributions to statistical inference for testing the superiority of one early-warning model over another, they lack an explicit focus on variations in misclassification costs and imbalanced data. A crucial characteristic of measures attempting to grasp a problem of this order of complexity is to explicitly tailor forecasting objectives and validations to the preferences of a decision-maker and the properties of the underlying data.

The literature on the derivation of a policymaker’s loss-function has attempted to deal with these so-called low-probability, high-impact events. Demirg¨u¸c-Kunt and Detragiache (2000) introduced the notion of a policymaker’s loss-function in a banking crisis context, where the policymaker has a cost for preventive actions and type I and II errors (i.e., probability of not receiving a warning conditional on a crisis occurring and of receiving a warning conditional on no crisis occurring).

Later, adaptations of this type of loss functions have been introduced to early-warning models for other types of crises, e.g., debt crises (Fuertes and Kalotychou, 2007), currency crises (Bussi`ere and Fratzscher, 2008), and asset price boom/bust cycles (Alessi and Detken, 2011). While Bussi`ere and Fratzscher (2008) still focused on costs of preventive actions, the later literature has mainly focused on the trade-off between type I and II errors. There are two key motivations for focusing on relative preferences between the errors:

i) the costs of actions and no actions can be incorporated in preferences between type I and II errors as unrealized benefits can be ”rolled up” into error costs (Elkan, 2001; Fawcett, 2006), and

ii) the uncertainty of exact costs associated with preventive actions, false alarms and missing crises.

In addition to a loss function, Alessi and Detken (2011) also propose a Usefulness measure that indicates whether the loss of the prediction is smaller than the loss of disregarding the model. However, while the above evaluation frameworks have be-come state-of-the-art, they fail to account for characteristics of imbalanced data.¹⁶ In the following, I put forward a loss function and Usefulness measure to better account for the complex nature of crises.

A loss function and Usefulness measure The occurrence of crisis can be represented with a binary state variable Ij(0) ∈ {0, 1} (where observation j = 1, 2, . . . , N ). Predicting the exact timing of distress does not, however, provide enough reaction time for a policymaker. The wide variety of triggers may also com-plicate the task of identifying exact timings. To enable policy actions for preventing or decreasing further build-up of vulnerabilities and strengthening the financial sys-tem, the focus should rather be on identifying pre-crisis periods I_j(h) ∈ {0, 1} with

16While the seminal loss function by Demirg¨u¸c-Kunt and Detragiache (2000) accounts for un-conditional probabilities, they do not propose a Usefulness measure for the function. Given their complex definition of loss, deriving the Usefulness would not be an entirely straightforward exercise. Further, the version applied in Bussi`ere and Fratzscher (2008) neither accounts for unconditional probabilities nor distinguishes between losses from correct and wrong calls of crisis.

Table 8.2: A contingency matrix.

Actual class Ij

Crisis No crisis

Predicted class Pj

Signal A B

True positive (TP) False positive (FP)

No signal C D

False negative (FN) True negative (TN)

a specified forecast horizon h. Let Ij(h) be a binary indicator that equals one dur-ing pre-crisis periods and zero otherwise. Usdur-ing univariate or multivariate data, various methods can be used for turning indicators into estimated probabilities of an impending crisis p_j ∈ [0, 1] (i.e., probability forecasts). To mimic the ideal lead-ing indicator I_j(h), the probability p_j is transformed into a binary point forecast P_j that equals one if p_j exceeds a specified threshold λ and zero otherwise. The correspondence between Pj and Ij can be summarized into a so-called contingency matrix (i.e., frequencies of prediction-realization combinations), as shown in Table 8.2.

From the elements of the above matrix, one can then define various goodness-of-fit measures. I approach the problem from the viewpoint of a policymaker.¹⁷ In a two-class prediction problem, policymakers can be assumed to have relative pref-erences of conducting two types of errors: issuing false alarms and missing crises.

Type I errors represent the probability of not receiving a warning conditional on a crisis occurring P (p ≤ λ | Ij(h) = 1) and type II errors the probability of receiv-ing a warnreceiv-ing conditional on no crisis occurrreceiv-ing P (p > λ | Ij(h) = 0). The loss of a policymaker consists of T₁ and T₂ weighted according to her relative prefer-ences between missing crises (µ ∈ [0, 1]) and giving false alarms (1 − µ). However, when only using T₁ and T₂ weighted according to relative preferences, we fail to account for imbalances in class size.¹⁸ Finally, given probabilities p_jof a model, the policymaker should aim at choosing a threshold λ such that her loss is minimized.

The preference parameters may also be derived from a benefit/cost matrix that matches the contingency matrix. A standard 2x2 benefit/cost matrix may easily be manipulated to only include error costs by scaling and shifting entries of columns without affecting the decisions (Elkan, 2001; Fawcett, 2006). A benefit may be treated as a negative error cost and hence unrealized benefits can be ”rolled up”

into error costs. For instance, the costs c for the elements of the matrix with two degrees of freedom can be derived to a simpler matrix of class-specific costs c1and c2 with one degree of freedom: c1 = cC− cA and c2 = cB− cD (the subscripts

17A further discussion on shaping decision-makers’ problems through loss functions, as well as on the relation between statistical and economic value of predictions, can be found in Granger and Pesaran (2000) and Abhyankar et al. (2005).

18The loss function used by Alessi and Detken (2011) differs from the one introduced here as it assumes equal class size. Their Usefulness measure does, similarly, not account for imbalanced classes, as the loss of disregarding a model depends solely on the preferences. Usefulness measures close to that in Alessi and Detken (2011) have been applied in a large number of works, such as Lo Duca and Peltonen (2013), Sarlin and Marghescu (2011a), El-Shagi et al. (2012), and Bisias et al. (2012). Similar loss functions have been applied in Fuertes and Kalotychou (2007), Candelon et al. (2012), Lund-Jensen (2012), and Knedlik and von Schweinitz (2012).

refer to Table 8.2). Most likely, cB and cC have a non-negative cost, while cA

and c_D have a non-positive cost. From this, we can derive the relative preferences µ = c₁/(c₁+ c₂) and 1 − µ = c₂/(c₁+ c₂).

By accounting for unconditional probabilities of crises P (I_j(h) = 1) and tranquil periods P (I_j(h) = 0) = 1 − P₁, a loss function is as follows:

L(µ) = µT₁P₁+ (1 − µ)T₂P₂ (8.3) As the parameters are unknown ex ante, we can use in-sample frequencies to esti-mate them. Given a threshold λ and forecast horizon h, P1 and P2 are estimated with the frequency of the classes (P1 = (A + C) / (A + B + C + D) and P2 = (B + D) / (A + B + C + D)) and T1and T2 with the error rates (T1= C/ (A + C) and T₂= B/ (B + D)). Using the loss function L(µ), we can then define the Use-fulness of a model. A policymaker could achieve a loss of min(P₁, P₂) by always issuing a signal of a crisis if P₁> 0.5 or never issuing a signal if P₂> 0.5. However, by weighting with policymakers’ preferences, as she may be more concerned of one of the classes, we achieve the loss min(µP₁, (1 − µ) P₂) when ignoring the model.

First, we derive the absolute Usefulness Ua(µ) of a model by computing the loss generated by the model subtracted from the loss of ignoring it:

U_a(µ) = min(µP₁, (1 − µ) P₂) − L(µ). (8.4) This measure highlights the fact that achieving well-performing, useful models on highly imbalanced data is a difficult task. Hence, already an attempt to build an early-warning model with imbalanced data implicitly necessitates a policymaker to be more concerned of the rare class. With a non-perfectly performing model, it would otherwise easily pay-off for the policymaker to always signal the high-frequency class. Second, we compute the share of U_a(µ) to the maximum possible Usefulness of the model with a measure that is coined relative Usefulness:

U_r(µ) = U_a(µ)

min(µP1, (1 − µ) P2). (8.5) That is, U_r(µ) reports U_a(µ) as a percentage of the Usefulness that a policymaker would gain with a perfectly performing model. This derives from the fact that if L(µ) = 0 then Ua(µ) = min(µP1, (1 − µ) P2). The Ur(µ) provides means for representing the Usefulness as a ratio rather than only reporting a number diffi-cult to judge. In particular, it facilitates comparisons of models for policymakers with different preferences. Within the above framework, we can derive Ua(µ) and Ur(µ) for policymakers of different kinds depending on their preferences, which is essentially a parameter to be specified ad hoc.

This derives to a cost matrix with costs µ for type I errors and 1 − µ for type II errors. While constants could be added to these entries and their scaling may be modified, this approach favors simplicity. Hence, the rationale for preferring this framework is that it enables setting relative preferences of the errors. Setting specific costs for each entry of the cost matrix is a difficult task in a real-world setting not only because the problem with two degrees of freedom may be difficult

to untangle, but also because most often exact values of cost matrix entries are unknown.

In addition to the above framework, the use of pooled panel data motivates in-cluding observation-specific costs into the loss function, as the importance of a single country in the evaluation phase may vary depending on the objectives of the policymaker. In an evaluation framework, this leads to a need for weighting entities in terms of their importance, such as systemic relevance or size. The entity-level importance is, however, also a time-varying parameter, and should thus more preferably be defined on the observation level. Although a policymaker’s loss func-tion and Usefulness measure that depend on observafunc-tion-varying costs are shown in Paper 6, this thesis focuses only on class-specific costs (that is, does not dis-criminate between the importance of countries).

Other goodness-of-fit measures The literature has provided and applied a wide range of goodness-of-fit measures. A large number of them can be defined from the elements of the contingency matrix in Table 8.2. Thus, the following goodness-of-fit measures are used to support the evaluation of models: recall and precision rates, False Positive (FP), True Positive (TP), False Negative (FN) and True Negative (TN) rates, and overall accuracy.¹⁹ In addition, the global per-formance of models can be measured using ROC curves and the area under the curve (AUC), i.e., under the ROC curve. The ROC curve shows the trade-off be-tween the benefits and costs of choosing a certain threshold. When two models are compared, the better model has a higher benefit (expressed in terms of TP rate on the vertical axis) at the same cost (expressed in terms of FP rate on the horizontal axis). In general, the ROC curve plots, for the whole range of measures, the conditional probability of positives to the conditional probability of negatives:

ROC = P (P = 1 | C = 1)/ (1 − P (P = 0 | C = 0)). In this sense, as each FP rate can be associated with a threshold for classifying crisis and tranquil events, the measure shows performance over all thresholds. The size of the AUC is estimated using trapezoidal approximations. It measures the probability that a randomly chosen crisis observation is ranked higher than a randomly chosen tranquil one. A random ranking has an expected AUC of 0.5, while a perfect ranking has an AUC equal to 1.

In document Mapping financial stability (Page 184-188)