Conditional Class Probability Estimation - Machine Learning Methods with Time Series Dependence

3.3.1 Introduction

Conditional class probability estimation extends the problem of quantile classifica-

tion from estimating at a particular quantileqto estimating at all arbitrary quantiles

q ∈ [0,1]. While machine learning methods have been adapted from estimating at

q = 1/2 in order to estimate at a particular q ∈[0,1], estimating at all quantiles is a significantly greater challenge.

Rather than proceeding from a singleq to allq as has been done in the machine

learning literature, the general approach in statistics has been to proceed in the

opposite direction. First, estimate the entire conditional class probability function.

Then, use this function to achieve classification at a particular quantile. For in-

stance, model-based approaches such as logistic regression give an estimate of the

conditional class probability function which can easily be transformed into an arbi-

trary quantile classifier by thresholding the probability function. Such approaches

are indeed very successful but under very restrictive conditions: they require knowl-

the parameters accurately.

When the functional form is unknown or data is scarce, conditional class prob-

ability estimation is extremely difficult. Whereas quantile classifiers only have to be accurate at one particular quantile, probability estimators must be accurate at

every quantile. That is, probability estimators not only have to perform the task of

a good quantile classifier, they must perform the tasks of allquantile classifiers and perform them well. In order to classify, both methods typically work by utilizing a

score function and thresholding it (usually, arbitrarily at zero). Quantile classifiers,

by focusing on one particular quantile, thus only need to be accurate up to the sign of the classifier to provide good performance on test sets; a conditional probability

estimator, on the contrary, must be accurate at all thresholds and therefore the

absolute value of the score function is also critical, not just the sign of it. Hence,

probability estimators face a much more difficult task.

3.3.2 Machine Learning Methods and Probability Estima-

tion

Many of the machine learning methods discussed in Chapter 2 can be used to form

conditional class probability estimates as well as classifications. We briefly review

some of the known results pertaining to these.

Not surprisingly, individual CART trees tend to give fairly poor conditional class

same conditional class probabilities for all points that fall within a given terminal

node, thus ignoring any heterogeneity among them. This unrealistic property is not

shared by the methods which combine trees such as boosting and Random forests

and therefore those methods hold greater hope for providing successful probability

estimates.

The forward, stage-wise additive view of boosting presented in Chapter 2 sug-

gests that AdaBoost can be transformed into an estimator of the conditional class

probability distribution via a link function (Friedman et al., 2000a). It also led to the development of other algorithms like LogitBoost which use the same forward,

stage-wise additive optimization but for other loss functions; these other algorithms

are therefore also equipped with link functions to obtain conditional class probabil-

ity estimates (Friedman et al., 2000a).

Logistic regression also uses a link function and is known to provide good prob-

ability estimates (and therefore good classifications) when the functional form is

known. Since AdaBoost provides good classifications even when the function form

is unknown, it was hoped that the link function of Friedman et al. (2000a) would transform it into a good probability estimator in such cases.

Unfortunately, it has been shown by several studies that AdaBoost and Logit-

Boost provide poor estimates of the conditional class probability distribution (Mease

et al., 2007; Mease and Wyner, 2008; McShane, 2007). Typically, when AdaBoost

ing probability estimates via the link function have diverged to near zero or one.

Furthermore, the same is true of LogitBoost despite the fact that estimation of class

probabilities via log-likelihood loss provided the motivation for this algorithm.

One of the reasons boosting is so successful at classification is that the ”score

function” (i.e., the weighted sum of base learners) tends to be very large in absolute

value: this leads to overfit probability estimates that diverge to zero or one (and

which are therefore quite poor) whereas it does not lead to overfit classifications

(because, for classification, only the sign of the score function–not its absolute

value–matters). Since providing probability estimates requires being a good quantile

classifier for allquantiles (i.e., the absolute value of the score function does matter), AdaBoost tends to fail at probability estimation.

The apparent failure of boosting to estimate probabilities and the theoretical

view of it as a forward, stagewise additive model have led to a number of refinements

of the algorithm. Obviously one such refinement is LogitBoost (Friedman et al., 2000a), but there are also suggestions for early stopping (Dettling and Buhlmann,

2003), shrinkage (Friedman et al., 2000b), regularization methods (Bickel et al., 2006; Jiang, 2004; Lugosi and Vayatis, 2004), and using shallower trees / weaker

base learners (Friedmanet al., 2000a; Hastieet al., 2001). But, given that boosting overfits probability estimates but not median estimates, it is questionable whether boosting’s success is due to similarity with logistic regression as suggested in Fried-

be misguided (Mease and Wyner, 2008).

If one wants to retain the forward, stagewise additive logistic regression view-

point, it thus seems one must temper it by noting that overfit probability estimates

may be required to attain optimal classifications. Furthermore, in the presence of

unequal misclassification costs (or imbalanced base rates or classification at quan-

tiles different from 1/2), this view may lead to poor performance: one must hope

one stops the boosting algorithm early enough such that the score function has not

diverged (and therefore produces bad probability estimates) but late enough that

the algorithm has sufficiently learned the data structure (and therefore produces

good class estimates).

A final point is that bagging, and in particular Random forests, tend to produce

much more reasonable and sometimes even quite good probability estimates as

is shown by Bostrom (2007) and Bostrom (2008) (particularly when calibrated)

and by our own results presented in Chapter 6. It is thought that, since these

techniques do not recursively re-weight the individual datapoints but instead rely

on the bootstrap, they avoid the overfitting tendency of boosting methods. Much

exploration is still needed, however.

In document Machine Learning Methods with Time Series Dependence (Page 73-77)