3.3.1
Introduction
Conditional class probability estimation extends the problem of quantile classifica-
tion from estimating at a particular quantileqto estimating at all arbitrary quantiles
q ∈ [0,1]. While machine learning methods have been adapted from estimating at
q = 1/2 in order to estimate at a particular q ∈[0,1], estimating at all quantiles is a significantly greater challenge.
Rather than proceeding from a singleq to allq as has been done in the machine
learning literature, the general approach in statistics has been to proceed in the
opposite direction. First, estimate the entire conditional class probability function.
Then, use this function to achieve classification at a particular quantile. For in-
stance, model-based approaches such as logistic regression give an estimate of the
conditional class probability function which can easily be transformed into an arbi-
trary quantile classifier by thresholding the probability function. Such approaches
are indeed very successful but under very restrictive conditions: they require knowl-
the parameters accurately.
When the functional form is unknown or data is scarce, conditional class prob-
ability estimation is extremely difficult. Whereas quantile classifiers only have to be accurate at one particular quantile, probability estimators must be accurate at
every quantile. That is, probability estimators not only have to perform the task of
a good quantile classifier, they must perform the tasks of allquantile classifiers and perform them well. In order to classify, both methods typically work by utilizing a
score function and thresholding it (usually, arbitrarily at zero). Quantile classifiers,
by focusing on one particular quantile, thus only need to be accurate up to the sign of the classifier to provide good performance on test sets; a conditional probability
estimator, on the contrary, must be accurate at all thresholds and therefore the
absolute value of the score function is also critical, not just the sign of it. Hence,
probability estimators face a much more difficult task.
3.3.2
Machine Learning Methods and Probability Estima-
tion
Many of the machine learning methods discussed in Chapter 2 can be used to form
conditional class probability estimates as well as classifications. We briefly review
some of the known results pertaining to these.
Not surprisingly, individual CART trees tend to give fairly poor conditional class
same conditional class probabilities for all points that fall within a given terminal
node, thus ignoring any heterogeneity among them. This unrealistic property is not
shared by the methods which combine trees such as boosting and Random forests
and therefore those methods hold greater hope for providing successful probability
estimates.
The forward, stage-wise additive view of boosting presented in Chapter 2 sug-
gests that AdaBoost can be transformed into an estimator of the conditional class
probability distribution via a link function (Friedman et al., 2000a). It also led to the development of other algorithms like LogitBoost which use the same forward,
stage-wise additive optimization but for other loss functions; these other algorithms
are therefore also equipped with link functions to obtain conditional class probabil-
ity estimates (Friedman et al., 2000a).
Logistic regression also uses a link function and is known to provide good prob-
ability estimates (and therefore good classifications) when the functional form is
known. Since AdaBoost provides good classifications even when the function form
is unknown, it was hoped that the link function of Friedman et al. (2000a) would transform it into a good probability estimator in such cases.
Unfortunately, it has been shown by several studies that AdaBoost and Logit-
Boost provide poor estimates of the conditional class probability distribution (Mease
et al., 2007; Mease and Wyner, 2008; McShane, 2007). Typically, when AdaBoost
ing probability estimates via the link function have diverged to near zero or one.
Furthermore, the same is true of LogitBoost despite the fact that estimation of class
probabilities via log-likelihood loss provided the motivation for this algorithm.
One of the reasons boosting is so successful at classification is that the ”score
function” (i.e., the weighted sum of base learners) tends to be very large in absolute
value: this leads to overfit probability estimates that diverge to zero or one (and
which are therefore quite poor) whereas it does not lead to overfit classifications
(because, for classification, only the sign of the score function–not its absolute
value–matters). Since providing probability estimates requires being a good quantile
classifier for allquantiles (i.e., the absolute value of the score function does matter), AdaBoost tends to fail at probability estimation.
The apparent failure of boosting to estimate probabilities and the theoretical
view of it as a forward, stagewise additive model have led to a number of refinements
of the algorithm. Obviously one such refinement is LogitBoost (Friedman et al., 2000a), but there are also suggestions for early stopping (Dettling and Buhlmann,
2003), shrinkage (Friedman et al., 2000b), regularization methods (Bickel et al., 2006; Jiang, 2004; Lugosi and Vayatis, 2004), and using shallower trees / weaker
base learners (Friedmanet al., 2000a; Hastieet al., 2001). But, given that boosting overfits probability estimates but not median estimates, it is questionable whether boosting’s success is due to similarity with logistic regression as suggested in Fried-
be misguided (Mease and Wyner, 2008).
If one wants to retain the forward, stagewise additive logistic regression view-
point, it thus seems one must temper it by noting that overfit probability estimates
may be required to attain optimal classifications. Furthermore, in the presence of
unequal misclassification costs (or imbalanced base rates or classification at quan-
tiles different from 1/2), this view may lead to poor performance: one must hope
one stops the boosting algorithm early enough such that the score function has not
diverged (and therefore produces bad probability estimates) but late enough that
the algorithm has sufficiently learned the data structure (and therefore produces
good class estimates).
A final point is that bagging, and in particular Random forests, tend to produce
much more reasonable and sometimes even quite good probability estimates as
is shown by Bostrom (2007) and Bostrom (2008) (particularly when calibrated)
and by our own results presented in Chapter 6. It is thought that, since these
techniques do not recursively re-weight the individual datapoints but instead rely
on the bootstrap, they avoid the overfitting tendency of boosting methods. Much
exploration is still needed, however.