• No results found

2. An Introduction to Boosting

2.6. Fitting GAMLSS with Boosting

As a short outlook that goes beyond the actual scope of this thesis we want to briefly discuss generalized additive models for location, scale and shape (GAMLSS) in the context of boosting. Boosted GAMLSS can be seen as an extension of the boosting algorithm itself, which is used as a basis to fit models for multiple components. Generalized additive models for location, scale and shape were first introduced by Rigby and Stasinopoulos (2001, 2005). This class of models represents a flexible extension of GAMs (Hastie and Tibshirani 1986, 1990). GAMs are typically used to model the mean of the conditional distribution of an outcome. This conditional distribution follows an exponential family. Hence, the scale and shape are im- plicitly defined by the mean — in combination with the distribution. However, if heteroscedastic or skewed distributions shall be modeled, this approach is not suit- able. In contrast, GAMLSS models regress multiple parameters of the conditional distribution — say θ1, . . . ,θK, where usuallyK ≤4 — on a set of covariates. These parameters might include the location (e.g., the mean), the scale (e.g., the standard deviation or the variance) and additional shape parameters (e.g., the skewness or the degrees of freedom). By using GAMLSS models, one gets rid of some of the shortcomings and limitations of classical GAMs. The price to pay is the additional complexity of the model and possibly a higher degree of instability compared to simpler generalized linear or generalized additive models.

2.6 Fitting GAMLSS with Boosting 37

the standard algorithms, which are used to estimate GAMLSS models (Rigby and Stasinopoulos 2005, App. B), is that they cannot be used to fit models to high- dimensional data with less observations than variables (n p). If many possible predictors are available variable selection becomes a crucial part of modeling the data. This is especially important for GAMLSSs as we have multiple components, which are all regressed on (subsets of) these predictors. Rigby and Stasinopoulos (2005) propose to use the generalized AIC (GAIC), with AIC and BIC as special cases, for model selection. The problem with this approach is that it tends to be highly instable. Furthermore, in data sets with many predictors the only feasible approaches that make use of the (G)AIC are stepwise approaches. This increases the instability even further and an ‘optimal’ or ‘near-optimal’ model might be com- pletely missed in this case. Other problems associated with the AIC are that it tends to overfit the data, i.e., in tendency the AIC favors models that are too large (e.g. Ripley 2004). In penalized fitting approaches, such as generalized additive models and their extensions, further conceptual problems arise, as — strictly speaking — the AIC is only valid for maximum likelihood estimation (Ripley 2004).

Our idea to overcome the problem of model selection, in this complex and (po- tentially) high-dimensional context, makes use of boosting with its intrinsic vari- able selection feature. The approach is based on a recently published boosting algorithm for multi-dimensional prediction functions (Schmid, Potapov, Pfahlberg, and Hothorn 2010b), which was extended to the fitting of GAMLSS models (Mayr et al. 2011b). In essence, an additional inner loop is processed within each boost- ing step. In each boosting iteration, we cycle consecutively through all distribution parameters θ1, . . . ,θK of the GAMLSS model. For each parameter θk, the nega- tive gradient is computed with respect to this parameter (i.e., the partial derivative of the loss function with respect to θk) and the current values of the distribution parameters are plugged in as offset values. Subsequently, the negative gradient vector is fitted by (component-wise) base-learners and the current estimate ofθk is updated with the best-fitting base-learner. Further details about the algorithm are provided in Mayret al. (2011b).

In the simplest version, one uses one common mstop for all distribution param-

eters. In many cases this might be sufficient. However, in other cases the model for the mean of the distributionθ1 might, for example, be more complex than the

model for the standard deviation θ2. In this case, using one single stopping itera-

tion might either result in overfitting of the standard deviation or in ‘underfitting’ of the mean, or possibly a mixture of both. Hence, it might be more sensible to

allow different values of mstop,k, k = 1, . . . ,K for the different components. In the algorithm this is achieved by skipping the update of parameters θk after mstop,k steps. Only parameters that did not reach their maximum number of iterations are further updated. An issue that arises in the case of ‘multi-dimensional stopping’ is that each of the stopping parameters has to be optimized simultaneously, i.e., by taking the stopping iterations of the other parameters into account. In line with the classical boosting approach cross-validation techniques are well advised, however, multi-dimensional cross-validation might become problematic if many distribu- tion parameters are estimated. Multi-dimensional stopping is of special interest in high-dimensional settings where variable selection is desired. More research on feasible strategies to find the optimal (multi-dimensional) stopping iterations is re- quired. One solution to this problem might be given by the ideas on AIC-based pre-stopping as discussed in Section 2.4.3. However, in this case, approximate de- grees of freedom would be required. On the other hand, if prediction is of primary interest and variable selection is of minor importance, multi-dimensional stopping seems less crucial. Using one single mstop might often be sufficient (as suggested

by the simulation studies in Mayr et al.2011b) due to the slow overfitting behavior of boosting.