2.3 Methods
2.3.2 Regression and model trees
RTs are classified as nonparametric and nonlinear methods. Similar to other regression methods, they can be applied to analyze the underlying dataset and to predict the (numeric) dependent variable. An essential difference between RTs and parametric methods, such as linear or logistic regressions, is that ex-ante no assumption is made concerning the distribution of the underlying data, and no functional relationship is specified.
These characteristics are particularly beneficial in case of LGD estimation be- cause it is typically not possible to describe the distribution of the LGD suitably with a single distribution, such as the normal distribution. In addition, the dis- tribution of the LGD varies significantly according to the underlying data. Thus, as described in Section 2.2, the LGD distributions of the three companies studied here are all multimodal, although there are appreciable differences between com- panies, such as the number of maxima. In particular, more types of distributions are observed for bank loans (for an overview, see Dermine and Neto de Carvalho (2005)).
The basic idea of regression and model trees is to partition the entire dataset into homogeneous subsets by a sequence of splits, which creates a tree consisting of logical if-then conditions. Starting with the root node of the tree that contains all instances of the underlying data, each leaf covers only a fraction of the data.
In an RT, the prediction of the dependent variable is given by a constant for all instances belonging to a leaf, typically defined as the average value of these instances. Model trees are an extension of RTs in the sense that the target vari- able of instances belonging to a leaf is estimated by a linear regression model. Therefore, model trees are hybrid estimation methods combining RTs and linear regression. Model trees are clearly applicable for LGD estimation because RTs are successfully used in previous studies such as Bastos (2010) and Qi and Zhao (2011). Linear regression models are also applied to analyze and predict LGDs,
2.3 Methods 23
and these models may deliver comparable or better results than those of more complex models, as shown by Bellotti and Crook (2012) and Zhang and Thomas (2012).
For our LGD estimation, we apply the M5' model tree algorithm and the cor- responding RT algorithm that is introduced by Wang and Witten (1997) and described by Witten et al. (2011). This algorithm is a reconstruction of Quinlan’s M5 algorithm that was published in 1992. In the case of the M5' algorithm, the underlying dataset is divided step by step, each time using the binary split based on the explanatory variables with the greatest expected reduction in the stan- dard deviation. The constructed tree is subsequently pruned back to obtain an appropriately sized tree to control overfitting, which can influence out-of-sample performance negatively.
The resulting tree essentially depends on the explanatory variables used, par- ticularly with respect to the M5' model tree algorithm; selecting appropriate vari- ables is a complex issue because of ex-ante relevance and effectiveness not always being known. In the first step, we consider the potential application of a large number of parameters. However, it might be preferable to include only a fraction of the available variables, which we account for in the second step.
There are various algorithmic approaches for variable selection; two frequently applicable greedy algorithms are forward selection and backward elimination. Bel- lotti and Crook (2012) use forward selection for their LGD estimations of retail credit cards with OLS regression. However, forward selection has a significant disadvantage neglecting variable interactions.
Instead of forward selection, we employ backward elimination, initiating all available variables and step by step eliminating the variables without which the best value in terms of the respective fit criterion is achieved. This procedure continues until a stop condition is reached, or all the variables are eliminated.
A typical fit criterion for regression models is the F-score. However, the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are used for forecasting, both of which are based on the log-likelihood function.
Analogous to the approximation of the AIC used by Bellotti and Crook (2012), the BIC can be approximated by
BIC =n·ln(MSE) +p·ln(n), (2.1)
where n denotes the number of observations, p is the number of input variables, and MSE is the mean squared error of the observations.
We use the BIC, which penalizes the complexity of the model more than the AIC. This complexity is measured by the number of input variables. In addition to the number of explanatory variables, regression and model trees offer another complexity feature: the number of leaves in the computed tree. This aspect is among those included by Gray and Fan (2008) when designing the TARGET RT algorithm. The more leaves that are present in the computed tree, the greater the risk is that a contract will be misclassified, which negatively influences the estimation.
We find that the number of leaves is determined not only by the pruning proce- dure but also by the input variables. Thus, we modify the BIC and penalize the size of the computed tree
BIC∗ =
n·ln(MSE) +p·ln(n) +|T| ·ln(n), (2.2)
where |T| denotes the number of leaves of the computed tree.
For BIC∗, lower values are preferred. As with our data, MSE ∈ (0
,1), n p and n |T| holds; thus, we have BIC∗ < 0. We set the stop condition for our backward elimination such that a variable in the i-th iteration can only be
2.3 Methods 25
eliminated if the BIC∗ value increased by an absolute value of at least one, which
implies that the following constraint must be fulfilled BIC∗
i−1−BIC
∗
i ≥1. (2.3)