Data description and model specification

3.3 Empirical analysis

3.3.1 Data description and model specification

To construct our dataset for the empirical analysis, we have merged three different datasets all of them referring to North American public firms. First, we collect monthly Standard & Poor’s (S&P) rating and default data from Compustat. The dataset contains three types of ratings: Long term issuer credit ratings, short term issuer credit ratings and subordinated debt ratings with most data of the first type. We define default in our study to be a

23_{Frailty specifications have received some popularity for modeling the dependence of} default events. In these models,Vit=Vt, so that the heterogeneity across periods is ad-

dressed only. In Duffie et al. (2009) and Koopman et al. (2008)Vtfollows an autoregressive

process and can be interpreted as an unobservable common risk factor.

24_{Especially, the ranking of the firms according to their default risk changed very little} as Spearman’s rank correlation coefficient for the predictions from the log-logistic model with and without frailty was equal to 0.9987.

default rating (D or SD) by S&P in any of the three rating types.25 _Conse-

quently, a firm is defined not to be in the default status in a given period if it does not have a default rating of any type and has a non-default rating of at least one type. We then merge the default histories with quarterly balance sheet data from Compustat and monthly stock market data from Compustat and the Center for Research in Security Prices (CRSP). The balance sheet variables are taken to be constant over the months between financial state- ments so that the final dataset has monthly time intervals. Since there are on average two months between the end of the corresponding fiscal period and the reporting date we lag the balance sheet variables by two months so that the values of the variables should have been indeed available in each month. Further, following common practice we exclude financial firms (Standard In- dustrial Classification (SIC) codes 6000-6799) since these are assumed to be structurally different. In a study concerning a very similar dataset as ours, Chava & Jarrow (2004) find that predictive accuracy is higher when financial firms are omitted from the sample.

In the data preparation process, we had to deal with both missing data and outliers. With respect to missing data, we imputed missing values for some variables based on regressions on their leads and lags.26 _{The main criterion}

was that the goodness-of-fit of such regressions is very high. For instance, the variable total assets can be very accurately predicted from past and fu- ture values whereas returns are known to be hard to predict. Consequently, we used imputations only for ”stable” variables like total assets and did not impute any values for variables like returns or net income.27 _{Using such im-}

putations will usually result in more efficient estimation. However, standard errors can be expected to be too low due to the reduced variability of imputed values (Harrell, 2001, Ch. 3.6). Since the share of missing values is rather low (no variable of our final model is missing in more than 10% of all cases)

25_{If a firm defaults on all its securities it receives a D (Default) rating while it is rated} SD (Selective Default) if the default event applies only to selected securities.

26_{This method is often called single conditional mean imputation (Harrell, 2001, Ch.} 3).

27_{The variables where imputations were used are total assets, cash and short-term in-} vestments, market value, interest expenses, retained earnings and total liabilities. The cases where missing variables remained had to be dropped from the subsequent analysis.

and since our focus is on prediction rather than on inference, efficiency gains should be more important.

To eliminate the effect of outliers, we winsorized all variables at the 5th and 95th percentile. An inspection of the data showed that implausible values (”wrong signs”) occasionally occur pointing to a need for winsorization. We further fitted our models to the data before and after winsorization and observed a remarkably better goodness-of-fit for the winsorized dataset. By winsorizing the data we follow the related literature where the use of this procedure is very common. The final dataset consists of 339 222 firm-months from 3575 firms in the period from December 1980 until March 2010. We observe 498 different default events, but note that our definition of Yit leads

to 18 914 partially overlapping lifetimes in our sample that end with a default event.

For the selection of our covariates, we used the experience from studies based on similar datasets (Shumway, 2001; Chava & Jarrow, 2004; Duffie et al., 2007; Campbell et al., 2008; L¨offler & Maurer, 2011) to choose candidate variables. Table 3.1 is a list of the covariates considered together with de- scriptive statistics. The final specification of our models was derived by a backward selection approach that entailed the sequential reduction of the model containing all candidate variables.28 _{As the main criteria in the model}

selection process we used the Wald statistics and the associated p values of the covariates since we have to be careful with likelihood ratio tests and information criteria in a pseudo likelihood setting. The liquidity variable (CATA) as well as retained earnings (RETA) were found to be insignificant (with p values larger than 0.5) so that we did not include them in the final model although the signs of the coefficients were as theoretically expected. Interest coverage (NII) was found to be significant but is strongly correlated with profitability (NITA). Due to this finding and the fact that the share of missing values was considerably higher for NII (16.6% vs. 3%) we dropped NII. For the covariates of the final model, all correlations are below 0.5 (see

28_{When deciding between forward and backward selection one must weigh up potential} biases arising from starting with a very simple model against potential data mining prob- lems when a very large model is the starting point (Greene, 2008, Ch. 7.2.4). Since the set of candidate variables is moderate in our analysis, we decided to use backward selection.

Table 3.1: Summary statistics for covariates

Name Description Mean St.dev. Min Max

Selected for final model

NITA Net income over previous year / Total assets .007 .020 -.155 .079 TLTA Total liabilities / Total assets .636 .168 .115 1 GRO Dummy for extreme growth of total assets .5 .5 0 1 RET Excess one-year log stock return over S&P 500 -.029 .367 -1.317 1.220 VOLA St. dev. of monthly log returns in previous year .110 .063 .039 .298 SIZE Log(market value / S&P 500 total market value) -8.99 1.72 -13.27 -6.34 Not selected for final model

CATA Cash and short-term investments / Total assets .071 .085 .001 .344 RETA Retained earnings / Total assets .134 .255 -.582 .537 NII Net income / Interest expenses over previous year 2.797 5.513 -3.993 25.295

Table 3.2: Correlations of covariates

NITA TLTA GRO RET VOLA SIZE

NITA 1.000 TLTA -0.343 1.000 GRO -0.221 0.107 1.000 RET 0.255 -0.103 -0.061 1.000 VOLA -0.438 0.201 0.261 -0.266 1.000 SIZE 0.380 -0.278 -0.171 0.280 -0.431 1.000

Table 3.2) so that multicollinearity should not pose a problem. Further, we looked for possible non-monotone effects of the variables on the hazard rate by grouping the covariates into quartiles and including the corresponding dummy variables into our model. We found strongly non-monotone effects for growth of total assets. Both high and low (highly negative) growth rates are associated with higher default risk. Therefore, our final model contains a dummy variable which is one if annual growth of total assets is in the upper or lower quartile and zero otherwise. The other covariates are quite standard and are used in this way or very similarly in the aforementioned studies.

In document Multi-Period Credit Default Prediction - A Survival Analysis Approach (Page 66-70)