3.3 Distribution-free estimation
3.3.2 Statistical learning and machine learning approaches
Over the last years various computer-intensive estimation procedures which originate from machine learning and statistical learning algorithms have been utilized for quantile regression. These approaches are completely distribution-free but do not always address a particular covariate structure as considered by the STAQ predictor in (3.2) on page 35.
In the following, we shortly describe three approaches that have been suggested to model the quantile function of a response variable depending on a (potentially large) number of covariates, that is quantile regression forests, quantile regression neural networks and kernel-based quantile regression using support vector machines. Since a detailed description of these highly complex and very different concepts would go beyond the scope of this thesis, we here just touch on them and refer to the corresponding literature for further reading.
Note that boosting, which will be treated in detail in Chapter 4, also belongs to the present class of distribution-free, computer-intensive estimation approaches and even allows for a structured
additive predictor to be modelled. Boosting can be rated as astatistical learningalgorithm since it
incorporates two competing goals of learning from data: prediction (which is the main goal in the machine learning community) and interpretation (which is an additional main goal in the statistical community).
Quantile regression forests
Chaudhuri and Loh (2002) made one of the first attempts to use tree-based methods for estimating conditional quantile functions. Few years later quantile regression forests were introduced by Meinshausen (2006) as an extension of random forests (Breiman, 2001) to quantile regression. The aim of quantile regression forests is to estimate the cumulative distribution function (cdf) of a response variable conditional on covariates without imposing any structure on their relationship. To achieve this aim, an ensemble of regression trees is grown similar to random forests as follows. First, a large number of bootstrap samples of the training data is drawn. Then, for every single bootstrap sample a random subset of the covariates is drawn and a regression tree is grown.
selected based on a test dataset. The conditional cdf for a new observation (vector)X = xis estimated by: ˆ F(y|X =x) = n X i=1 wi(x)I(yi≤y),
where yi denotes the response, I(·) an indicator function and wi(x) stand for weights of the
original observations i = 1, . . . , n, which are calculated by dropping x down all trees. More
specifically, for each single tree the observations which share the leaf with the newxget non-
zero, uniformly distributed weights. The resulting weights wi(x) are the averages over these
observation-specific weights from all trees. The quantile function is finally obtained by inverting the cdf. The main difference of quantile regression forests to the original random forests algorithm is that one takes note of all observations in each leaf and not only of the mean, which allows
to estimate the whole conditional cdf for a new observation (vector)X =xas described above.
In addition, Meinshausen (2006) gives a proof for the consistency of the cdf estimated in this manner.
It is difficult to rate quantile regression forests regarding our criteria for model assessment since
they do not address a flexible predictor and can rather be seen as black box estimators.
Therefore, it is not possible to explicitly quantify the relationships between covariates and
response and to obtain inference results for single estimators. In random forests variable
selectionis possible by applying variable importance measures, see for example Strobl et al.
(2008), and it could be a matter of future research how these measures can be adapted for quantile regression forests. The main advantages of quantile regression forests are their applicability for high-dimensional data and their implicit prevention of quantile crossing by estimating the full conditional cdf in one step. Since random forests typically perform well in prediction settings, Meinshausen (2006) suggested to apply quantile regression forests for the construction of prediction intervals for new observations, as was already sketched in alternative 1
for the usage of quantile regression in Section 1.2.Softwarefor fitting quantile regression forests
is available in the R packagequantregForest(Meinshausen, 2012).
Quantile regression neural networks
Taylor (2000) introduced quantile regression neural networks as another computer-intensive algorithm which is well suited for prediction and forecasting. The standard approach of artificial neural networks provides a general concept for fitting nonlinear high-dimensional regression models based on the minimization of a loss criterion. Quantile regression is performed when the check function is inserted as a special loss function in the standard algorithm.
Since neural networks rely on gradient-based nonlinear optimization, they theoretically require a loss function which is differentiable everywhere. Due to its kink point at zero this is not fulfilled for the check function, however, and it is not clear if convergence problems might occur when applying the standard optimization algorithm for neural networks (Taylor, 2000). As a solution Cannon (2011) replaced the check function by a differentiable loss function – an approximation which had first been suggested by Chen (2007) – and adapted the quantile regression neural network algorithm of Taylor (2000) accordingly.
Assessing quantile regression neural networks regarding our criteria is as difficult as assessing quantile regression forests since neural networks just provide black box estimators without giving
addressed nor can results oninferenceof single parameters andvariable selectionbe obtained. The use of quantile neural networks makes sense when predictions or predictive densities are of interest, as demonstrated in the example of Cannon (2011), where daily precipitation amounts
were forecasted. However, the danger of quantile crossing is incurred. Concerning software,
quantile regression neural networks are implemented in the R packageqrnn(Cannon, 2011).
Kernel-based quantile regression
Another class of completely distribution-free estimation approaches for quantile regression originates from the powerful framework of support vector machines (SVMs). The generic structure of empirical SVMs was described in Christmann and Hable (2012): the aim is to minimize an empirical loss criterion based on a convex loss function between response variable and an unspecified regression function of the covariates. This regression function is assumed to belong to a reproducing kernel Hilbert space (RKHS) and is penalized by a suitable RKHS norm penalty to avoid overfitting and ensure existence.
Takeuchiet al.(2006) directly started from the check function as loss function (which is called
pinball loss functionin the machine learning community), whereas Christmann and Hable (2012) formulated the minimization problem of empirical SVMs in a general way and considered the check function as one special instance which leads to quantile regression.
Regarding aflexible predictor, no structure is assumed for the covariate predictor and therefore
for the relationship between covariates and response in the general formulation of empirical SVMs. However, it is possible to impose a structure by choosing suitable kernel functions for different covariates. For example, Christmann and Hable (2012) considered an additive model with smooth nonlinear functions of continuous covariates. This model covers some components of the generic predictor in (3.2). A crucial tuning parameter of these algorithms is the regularization
parameter λ which can for example be chosen by cross-validation. Results on estimator
properties and inferencehave recently been obtained by Christmann and Hable (2012) showing consistency of the SVM estimators. In addition, asymptotic confidence sets for the estimators, which can deliver pointwise asymptotical confidence intervals, were derived by Hable (2012). Quantile crossing can occur due to the separate regression fits for different quantile parameters.
With regard tosoftware, kernel-based quantile regression with SVMs can be fitted by the function