Before discussing the general problem of estimating the entire conditional class
probability distribution, we first narrow our focus even further to discussion of esti-
mating the boundary or region for a particular quantile. That is, we are interested
in identifying X such that P(Y = 1|X =x)> q where q is the quantile of interest. We will show that this subproblem is quite important in its own right and that
methods exist which solve it reasonably well.
3.2.1
Unequal Costs
In the standard binary classification task (i.e., k = 2), classifiers are usually judged
by misclassification error. Minimizing misclassification error is equivalent to mini-
mizing the loss function which gives equal costs to false positives and false negatives;
it is also equivalent to classifying at the 1/2 quantile of the conditional class prob-
ability function P(Y = 1|x).
ror assumes equal costs, it is an inappropriate loss function. For instance, in the
classic courtroom setting, sending an innocent man to prison is considered worse
than failing to convict a guilty man; likewise, in many medical applications, false
negatives are more serious than false positives.
Without loss of generality, we can assume the cost of a false positive and the
cost of false negative sum to one and that they equal c and 1−c respectively. If
p(x) = P(Y = 1|x) and 1−p(x) =P(Y = 0|x) are the conditional probabilities of a positive and negative respectively, then, the risk, or expected loss, of classifying
a positive is (1−p(x))c and the risk of classifying a negative is p(x)(1−c). In order to minimize risk, we classify as a positive when (1 −p(x))c < p(x)(1 −c) which is equivalent to c < p(x). This shows that binary classification with unequal
costs is equivalent to quantile estimation, estimating the region p(x)> q =c. Most classifiers, which implicitly assume equal costs, are therefore median classifiers since
they estimate the region p(x)> q = 1/2.
A final point is that, besides arising from unequal misclassification costs, quantile
classification can also be formulated as an end in itself. For instance, in an internet
marketing campaign, one may only want to serve an ad on a particular website if
the probability of a user clicking the ad is greater than some threshold probability
3.2.2
Imbalanced Base Rates
The problem of imbalanced base rates occurs when one applies a classifier trained
on one dataset with one set of base rate probabilities to a dataset with a different set
of base rate probabilities. For example, one might train a classifier on a population
with 20% positives but apply it to a population with 50% positives. Below, we
show that a change in the base rate is equivalent to changing the quantile at which
to threshold the calculations. Hence, imbalanced base rates, quantile classification,
and classifying with unequal costs of false positives and negatives are equivalent to
one another (Elkan, 2001; Mease et al., 2007). Let
p(x) = P(Y = 1|X =x) π = P(Y = 1) f1(x) = P(X =x|Y = 1) f0(x) = P(X =x|Y = 0). By Bayes Theorem p(x) = f1(x)π f1(x)π+f0(x)(1−π) . Equivalently, p(x) 1−p(x) = f1(x)π f0(x)(1−π)
or p(x) 1−p(x)/ π 1−π = f1(x) f0(x) . (3.2.1)
Now, assume there is another population which is the same in all respects except
that the base rates π and 1−π are different. Assume in this new population, the base rates are π∗ and 1−π∗. If we let p∗(x) =P(Y = 1|X =x) be the conditional probability that Y = 1 in this new population, Equation 3.2.1 implies that p(x)
and p∗(x) can be related as follows:
p(x) 1−p(x)/ π 1−π = f1(x) f0(x) = p ∗(x) 1−p∗(x)/ π∗ 1−π∗. Hence, p∗(x) 1−p∗(x) = p(x) 1−p(x) 1−π π π∗ 1−π∗. (3.2.2)
Thus, we can obtain a classifier on the new population by adjusting the old one
for the new base rates. This has a profound implication: while it is obvious that
an algorithm that produces good probability estimates will also produce good class
estimates, Equation 3.2.2 suggests an algorithm that produces good class estimates
will also produce good probability estimates if the base rate distribution is ”tilted”
in the proper way (in fact, this is the motivation Jittered Over/Under-Sampling-
Boost (JOUS-Boost) technique of Mease et al. (2007)). That is, there may be an isomorphism of sorts between the space of good classifiers and the space of good
3.2.3
Machine Learning Approaches to Quantile Estimation
Binary classification with unequal costs or for populations with imbalanced base
rates is common in the literature. A classic example of the latter is bankruptcy
prediction where there are vast numbers of negatives but very few positives (Foster
and Stine, 2004); one must correct for this discrepancy in order to make accurate
predictions.
Algorithms such as AdaBoost (Freund and Schapire, 1996), which have proven
exceptional at classification at the 1/2 quantile, have been modified to classify
with unequal costs. Such modifications include Slipper (Cohen and Singer, 1999),
AdaCost (Fan et al., 1999), CSB1 and CSB2 (Ting, 2000), and RareBoost (Joshi
et al., 2001). All have shown some improvement over AdaBoost, but no method
appears to dominate.
Another approach for dealing with the triply equivalent problem of unequal costs
/ quantile thresholding / imbalanced base rates that is popular in the computer sci-
ence literature involves under-sampling and over-sampling (Chan and Stolfo, 1998;
Elkan, 2001; Estabrooks et al., 2004). One typically over-samples the rare class with replacement and/or under-samples the dominant class without replacement.
Sampling with replacement carries with it the concomitant issue of ties in the sam-
ple (i.e., repeated datapoints). Tie-breaking is necessary for certain algorithms
which are driven more by the set of unique datapoints than by number of tied ones
thetic Minority Over-Sampling TEchnique (SMOTE) of Chawla et al.(2002, 2003) which avoids ties in over-sampled classes by moving the sampled covariates towards
neighbors of the same class.