Quantile Estimation - Machine Learning Methods with Time Series Dependence

Before discussing the general problem of estimating the entire conditional class

probability distribution, we first narrow our focus even further to discussion of esti-

mating the boundary or region for a particular quantile. That is, we are interested

in identifying X such that _P(Y = 1|X =x)> q where q is the quantile of interest. We will show that this subproblem is quite important in its own right and that

methods exist which solve it reasonably well.

3.2.1 Unequal Costs

In the standard binary classification task (i.e., k = 2), classifiers are usually judged

by misclassification error. Minimizing misclassification error is equivalent to mini-

mizing the loss function which gives equal costs to false positives and false negatives;

it is also equivalent to classifying at the 1/2 quantile of the conditional class prob-

ability function _P(Y = 1|x).

ror assumes equal costs, it is an inappropriate loss function. For instance, in the

classic courtroom setting, sending an innocent man to prison is considered worse

than failing to convict a guilty man; likewise, in many medical applications, false

negatives are more serious than false positives.

Without loss of generality, we can assume the cost of a false positive and the

cost of false negative sum to one and that they equal c and 1−c respectively. If

p(x) = _P(Y = 1|x) and 1−p(x) =_P(Y = 0|x) are the conditional probabilities of a positive and negative respectively, then, the risk, or expected loss, of classifying

a positive is (1−p(x))c and the risk of classifying a negative is p(x)(1−c). In order to minimize risk, we classify as a positive when (1 −p(x))c < p(x)(1 −c) which is equivalent to c < p(x). This shows that binary classification with unequal

costs is equivalent to quantile estimation, estimating the region p(x)> q =c. Most classifiers, which implicitly assume equal costs, are therefore median classifiers since

they estimate the region p(x)> q = 1/2.

A final point is that, besides arising from unequal misclassification costs, quantile

classification can also be formulated as an end in itself. For instance, in an internet

marketing campaign, one may only want to serve an ad on a particular website if

the probability of a user clicking the ad is greater than some threshold probability

3.2.2 Imbalanced Base Rates

The problem of imbalanced base rates occurs when one applies a classifier trained

on one dataset with one set of base rate probabilities to a dataset with a different set

of base rate probabilities. For example, one might train a classifier on a population

with 20% positives but apply it to a population with 50% positives. Below, we

show that a change in the base rate is equivalent to changing the quantile at which

to threshold the calculations. Hence, imbalanced base rates, quantile classification,

and classifying with unequal costs of false positives and negatives are equivalent to

one another (Elkan, 2001; Mease et al., 2007). Let

p(x) = _P(Y = 1|X =x) π = _P(Y = 1) f1(x) = P(X =x|Y = 1) f0(x) = P(X =x|Y = 0). By Bayes Theorem p(x) = f1(x)π f1(x)π+f0(x)(1−π) . Equivalently, p(x) 1−p(x) = f1(x)π f0(x)(1−π)

or p(x) 1−p(x)/ π 1−π = f1(x) f0(x) . (3.2.1)

Now, assume there is another population which is the same in all respects except

that the base rates π and 1−π are different. Assume in this new population, the base rates are π∗ and 1−π∗. If we let p∗(x) =_P(Y = 1|X =x) be the conditional probability that Y = 1 in this new population, Equation 3.2.1 implies that p(x)

and p∗(x) can be related as follows:

p(x) 1−p(x)/ π 1−π = f1(x) f0(x) = p ∗₍_x₎ 1−p∗(x)/ π∗ 1−π∗. Hence, p∗(x) 1−p∗(x) = p(x) 1−p(x) 1−π π π∗ 1−π∗. (3.2.2)

Thus, we can obtain a classifier on the new population by adjusting the old one

for the new base rates. This has a profound implication: while it is obvious that

an algorithm that produces good probability estimates will also produce good class

estimates, Equation 3.2.2 suggests an algorithm that produces good class estimates

will also produce good probability estimates if the base rate distribution is ”tilted”

in the proper way (in fact, this is the motivation Jittered Over/Under-Sampling-

Boost (JOUS-Boost) technique of Mease et al. (2007)). That is, there may be an isomorphism of sorts between the space of good classifiers and the space of good

3.2.3 Machine Learning Approaches to Quantile Estimation

Binary classification with unequal costs or for populations with imbalanced base

rates is common in the literature. A classic example of the latter is bankruptcy

prediction where there are vast numbers of negatives but very few positives (Foster

and Stine, 2004); one must correct for this discrepancy in order to make accurate

predictions.

Algorithms such as AdaBoost (Freund and Schapire, 1996), which have proven

exceptional at classification at the 1/2 quantile, have been modified to classify

with unequal costs. Such modifications include Slipper (Cohen and Singer, 1999),

AdaCost (Fan et al., 1999), CSB1 and CSB2 (Ting, 2000), and RareBoost (Joshi

et al., 2001). All have shown some improvement over AdaBoost, but no method

appears to dominate.

Another approach for dealing with the triply equivalent problem of unequal costs

/ quantile thresholding / imbalanced base rates that is popular in the computer sci-

ence literature involves under-sampling and over-sampling (Chan and Stolfo, 1998;

Elkan, 2001; Estabrooks et al., 2004). One typically over-samples the rare class with replacement and/or under-samples the dominant class without replacement.

Sampling with replacement carries with it the concomitant issue of ties in the sam-

ple (i.e., repeated datapoints). Tie-breaking is necessary for certain algorithms

which are driven more by the set of unique datapoints than by number of tied ones

thetic Minority Over-Sampling TEchnique (SMOTE) of Chawla et al.(2002, 2003) which avoids ties in over-sampled classes by moving the sampled covariates towards

neighbors of the same class.

In document Machine Learning Methods with Time Series Dependence (Page 68-73)