Chapter 3 Model Construction for Phantom Examples
3.2 Analysis
3.2.1
Domain Knowledge Bias
Consider a binary classification task where given the input x ∈ X we try to predict the label y ∈ {−1, 1}, assuming that there exists an underlying joint distribution D for (x, y) ∈ X × Y which is unknown. When training a discriminative learner, we present the learner with labeled examples from two different classes, sampled from their respective underlying true distributions p0(X) =
PrD(X|Y = −1) and p1(X) = PrD(X|Y = 1).
We can think of communicating the distributions p0(X) and p1(X) to the learner through the
training examples. However, for a finite training set, the empirical distribution may be inadequate for the learner to come up with a good decision boundary.
We augment the training set by generating phantom examples drawn from class-specific gen- erative models with parameters θ = (θ0, θ1), where θ0 ∈ Θ0 describes the model for class −1 and
θ1 ∈ Θ1for class 1. The corresponding distributions are denoted qθ00(X) = Prθ0(X|Y = −1) and
qθ1
1 (X) = Prθ1(X|Y = 1) respectively. Both θ0 and θ1are obtained by fitting the parameters to
the real training set.
For complex, real-world problems, it is extremely unlikely that our space of generative models contains the true distributions p0and p1. However, we expect that q0and q1 will be closer to
the true distributions when tuned with more real examples. Using KL-divergence as a measure of closeness, we define the bias of the trained generative models as follows:
Bias(θ0, θ1) = KL(p0||q0θ0) + KL(p1||qθ11)
where KL denotes the KL-divergence.
Since the KL-divergence is nonnegative, there exist a minimal bias
²∗ θ= infθ
0,θ1
Bias(θ0, θ1).
more real examples, the resulting models θ0 and θ1 will have bias ²θ0,θ1 closer to ²∗θ.
The bias provides an upper bound to the deviation from the optimal Bayes error rate when distributions q0 and q1 are used to classify test examples which are drawn according to p0 and
p1. This is given by:
Proposition 1. Assume that the class prior for each class is uniform (i.e. Pr(Y = −1) = Pr(Y = 1) = 1/2). Let ²∗ be the optimal Bayes error for examples drawn according to p
0 and p1. The
expected classification error when qθ0
0 and qθ11 are used to make decision (instead of p0 and p1),
denoted ²θ0,θ1, is upper-bounded by:
²θ0,θ1≤ ² ∗+ r Bias(θ0, θ1) 2 (3.1) Proof. ²θ0,θ1− ² ∗ ≤ X X 1 2|p0(X) − q0(X)| + 1 2|p1(X) − q1(X)| ≤ 1 2 ³p 2KL(p0||q0) + p 2KL(p1||q1) ´ ≤ r KL(p0||q0) + KL(p1||q1) 2
where the first inequality is due to [Devroye et al., 1996] and the second inequality employs the Pinsker’s inequality.
Proposition 1 tells us that as long as the generative models are close enough to the true mod- els, the achievable error rate will be close to the true optimal. Depending on the problem, it may or may not attain the true optimal, and this largely depends on the amount of information our generative models can provide regarding the decision boundary of the problem.
Assume that the space of classifiers learnable by our discriminative learner includes those with performance arbitrarily close to the optimal Bayes error. Let ²N ,M,N˜ be the expected error of the
resulting classifier when N real examples and M phantom examples are used to train the discrim- inative learner, where the phantom examples are generated using a model tuned with ˜N real ex- amples ( ˜N and N will usually be the same, i.e. all real examples participate in tuning the gener- ative models as well as training the final classifier). Let ²∗be the best possible error rate (Bayes
optimal error) for the particular problem distribution D, we expect that ²N ,M,N˜ → ²∗ as the num-
learner is used.
The more interesting question would be regarding the expected performance when we increase the number of phantom examples used in the training of the discriminative learner. This largely depends on the quality of the generative models used, which in turn depends on the quality of the prior domain knowledge. Regardless of this, we expect that ²N ,M,N˜ → ²∗θ when both ˜N , M → ∞
(holding N fixed).
As the phantom set grows, we can expect that the amount of information that it extracts from the generative models to reach an asymptote (the corresponding error rate would be ²∗
θ). If the
generative models are helpful (good domain knowledge), the effect will be positive; if the genera- tive models are harmful (bad domain knowledge), the effect will be negative.
Another advantage of using phantom examples, in addition to the improved classification accu- racy, is the robustness to classification noise. With the domain knowledge, the classifier is learned not only on the noisy distribution given by the training examples, but a distribution biased to one justified by the domain theory (e.g. the generative models for the hidden strokes), making the classifier less likely to fit the noise. When tested on the (noise-free) testing data, the classifier may more likely reveal the true labeling of the examples rather than the noisy labeling.
3.2.2
Model Space
Since we tune our generative models using only a finite number of real examples, it is susceptible to overfitting. This in turn will affect the quality of the actual phantom examples generated. We would like our system to adaptively select a model that achieves a balance between good fit and model complexity.
Instead of using an a priori fixed generative model for each class of objects, we allow a (pos- sibly infinite) space M of different generative models, each with different complexity and can be parameterized differently. As mentioned earlier, learning in the space of models can be greatly facilitated by having a structure over the model space M. The structure should ideally allow a lo- cal search within model classes, where models of similar complexities are close to each other. For this work, we assume that each object class can be decomposed into conceptual parts. The model space is then constructed based on the models for individual parts.
We first define the variables involved in any particular model. We use X to denote the input space (for example, images) and X the corresponding random variable that represents an input instance, where X takes values from X . It is often the case that X is a high-dimensional space
where each individual component contains very little information about the class label (e.g. pixels in an image). Modeling a distribution over X is therefore undesirable. With prior domain knowl- edge, however, we can usually identify a higher-level, low-dimensional space of “hidden” variables (such as properties of strokes in a character), denoted here as Z, which takes values from Z. We assume that there is a function φ : Z 7→ X that converts each instance of Z into a corresponding input-space instance X = φ(Z).
The variable Z can be further decomposed as:
Z = (ZΓ1, ZΓ2, . . . , ZΓk)
where each Γi, (i = 1, . . . , k) denotes a model for part i of the object class/category. Each model
Γi is represented by a d(Γi)-dimensional random vector, which we denote as
ZΓi = (Z 1 Γi, Z 2 Γi, . . . , Z d(Γi) Γi ), i = 1, . . . , k.
Therefore, Z is a ˜d-dimensional random vector where ˜d =Pki=1d(Γi).
The model Γi for each object part i can be chosen differently from a set of possible models Γ∗.
For example, when modeling strokes in a handwritten character, each stroke can either be mod- eled as a straight line segment or a curve segment. In this case, let Γ∗ = {Γ
line, Γcurve}. Then
each Γi can either be Γline or Γcurve.
The complete generative model for a particular object class C that consists of k object parts can be fully specified by selecting an object model for each part. We use a k-tuple ΓC= hΓ
1, . . . , Γki
to represent this selection. Again, using the stroke models above, a character with 3 strokes, for example, can have a particular model hΓline, Γline, Γcurvei, indicating that the first 2 strokes are
modeled as straight lines, while the 3rd stroke as a curve segment.
Considering now a binary classification task between some object classes C0 and C1. Let k0
and k1denote the number of parts in each object class respectively, then the space of possible
models for this particular task, MC0,C1, consists of all possible model combinations for C
0 and
C1, that is,
MC0,C1 = {hΓC0, ΓC1i}
where ΓC0 and ΓC1 are k0-tuple and k1-tuple describing the respective choices of part models for
By decomposing each object into parts and allowing the possibility of modeling each part dif- ferently, we have introduced a structure within the model space. By taking advantage of this struc- ture, we can perform adaptive model refinement that does not require the system to evaluate ev- ery possible model and can therefore entertain a much richer space of models.
3.2.3
Model Space and Well-Formed Concepts
Each member of MC0,C1 (i.e. an instance of hΓC0, ΓC1i) can be seen as representing a well-formed
concept discussed in Chapter 2. Each pair of distributions (over Z) for class C0and C1 defines an
implicit decision boundary, which corresponds to the Bayes optimal decision for the task if these were the true task distributions.
From a Bayesian point of view, the search for the optimal model can be seen as optimizing a posterior on the model space:
Pr(M|z) ∝ Pr(z|M)Pr(M)
where M represents a model in M and z is the training data.
Since this is a classification task, the data likelihood, Pr(z|M) should be based on the empiri- cal classification loss and not the data modeling probability within the individual class. For exam- ple, Pr(z|M) = QiPr(yi|xi, M) but not Pr(z|M) =
Q
iPr(xi, yi|M). In our algorithm, we directly
estimate Pr(z|M) using the empirical error rate of a classifier trained using the augmented data set.
For Pr(M), it is natural to assign less probability to models that are more complex. Since, by design, we can only reach a more complex model by refining a current one (e.g. by refining the model for a specific object part), we can perform a best-first search on the model space based on the estimated Pr(z|M) alone without the need to evaluate all possible models. Note also that this best-first search is consistent with the general search strategy that we outlined in Chapter 2.