Analysis - Model Construction for Phantom Examples

Chapter 3 Model Construction for Phantom Examples

3.2 Analysis

3.2.1 Domain Knowledge Bias

Consider a binary classification task where given the input x ∈ X we try to predict the label y ∈ {−1, 1}, assuming that there exists an underlying joint distribution D for (x, y) ∈ X × Y which is unknown. When training a discriminative learner, we present the learner with labeled examples from two different classes, sampled from their respective underlying true distributions p0(X) =

PrD(X|Y = −1) and p1(X) = PrD(X|Y = 1).

We can think of communicating the distributions p0(X) and p1(X) to the learner through the

training examples. However, for a finite training set, the empirical distribution may be inadequate for the learner to come up with a good decision boundary.

We augment the training set by generating phantom examples drawn from class-specific gen- erative models with parameters θ = (θ0, θ1), where θ0 ∈ Θ0 describes the model for class −1 and

θ1 ∈ Θ1for class 1. The corresponding distributions are denoted qθ00(X) = Prθ0(X|Y = −1) and

qθ1

1 (X) = Prθ1(X|Y = 1) respectively. Both θ0 and θ1are obtained by fitting the parameters to

the real training set.

For complex, real-world problems, it is extremely unlikely that our space of generative models contains the true distributions p0and p1. However, we expect that q0and q1 will be closer to

the true distributions when tuned with more real examples. Using KL-divergence as a measure of closeness, we define the bias of the trained generative models as follows:

Bias(θ0, θ1) = KL(p0||q0θ0) + KL(p1||qθ11)

where KL denotes the KL-divergence.

Since the KL-divergence is nonnegative, there exist a minimal bias

²∗ θ= inf_θ

0,θ1

Bias(θ0, θ1).

more real examples, the resulting models θ0 and θ1 will have bias ²θ0,θ1 closer to ²∗θ.

The bias provides an upper bound to the deviation from the optimal Bayes error rate when distributions q0 and q1 are used to classify test examples which are drawn according to p0 and

p1. This is given by:

Proposition 1. Assume that the class prior for each class is uniform (i.e. Pr(Y = −1) = Pr(Y = 1) = 1/2). Let ²∗ _{be the optimal Bayes error for examples drawn according to p}

0 and p1. The

expected classification error when qθ0

0 and qθ11 are used to make decision (instead of p0 and p1),

denoted ²θ0,θ1, is upper-bounded by:

²θ0,θ1≤ ² ∗₊ r Bias(θ0, θ1) 2 (3.1) Proof. ²θ0,θ1− ² ∗ _≤ X X 1 2|p0(X) − q0(X)| + 1 2|p1(X) − q1(X)| ≤ 1 2 ³p 2KL(p0||q0) + p 2KL(p1||q1) ´ ≤ r KL(p0||q0) + KL(p1||q1) 2

where the first inequality is due to [Devroye et al., 1996] and the second inequality employs the Pinsker’s inequality.

Proposition 1 tells us that as long as the generative models are close enough to the true models, the achievable error rate will be close to the true optimal. Depending on the problem, it may or may not attain the true optimal, and this largely depends on the amount of information our generative models can provide regarding the decision boundary of the problem.

Assume that the space of classifiers learnable by our discriminative learner includes those with performance arbitrarily close to the optimal Bayes error. Let ²_{N ,M,N}˜ be the expected error of the

resulting classifier when N real examples and M phantom examples are used to train the discrim- inative learner, where the phantom examples are generated using a model tuned with ˜N real ex- amples ( ˜N and N will usually be the same, i.e. all real examples participate in tuning the gener- ative models as well as training the final classifier). Let ²∗_{be the best possible error rate (Bayes}

optimal error) for the particular problem distribution D, we expect that ²N ,M,N˜ → ²∗ as the num-

learner is used.

The more interesting question would be regarding the expected performance when we increase the number of phantom examples used in the training of the discriminative learner. This largely depends on the quality of the generative models used, which in turn depends on the quality of the prior domain knowledge. Regardless of this, we expect that ²N ,M,N˜ → ²∗θ when both ˜N , M → ∞

(holding N fixed).

As the phantom set grows, we can expect that the amount of information that it extracts from the generative models to reach an asymptote (the corresponding error rate would be ²∗

θ). If the

generative models are helpful (good domain knowledge), the effect will be positive; if the generative models are harmful (bad domain knowledge), the effect will be negative.

Another advantage of using phantom examples, in addition to the improved classification accu- racy, is the robustness to classification noise. With the domain knowledge, the classifier is learned not only on the noisy distribution given by the training examples, but a distribution biased to one justified by the domain theory (e.g. the generative models for the hidden strokes), making the classifier less likely to fit the noise. When tested on the (noise-free) testing data, the classifier may more likely reveal the true labeling of the examples rather than the noisy labeling.

3.2.2 Model Space

Since we tune our generative models using only a finite number of real examples, it is susceptible to overfitting. This in turn will affect the quality of the actual phantom examples generated. We would like our system to adaptively select a model that achieves a balance between good fit and model complexity.

Instead of using an a priori fixed generative model for each class of objects, we allow a (pos- sibly infinite) space M of different generative models, each with different complexity and can be parameterized differently. As mentioned earlier, learning in the space of models can be greatly facilitated by having a structure over the model space M. The structure should ideally allow a lo- cal search within model classes, where models of similar complexities are close to each other. For this work, we assume that each object class can be decomposed into conceptual parts. The model space is then constructed based on the models for individual parts.

We first define the variables involved in any particular model. We use X to denote the input space (for example, images) and X the corresponding random variable that represents an input instance, where X takes values from X . It is often the case that X is a high-dimensional space

where each individual component contains very little information about the class label (e.g. pixels in an image). Modeling a distribution over X is therefore undesirable. With prior domain knowl- edge, however, we can usually identify a higher-level, low-dimensional space of “hidden” variables (such as properties of strokes in a character), denoted here as Z, which takes values from Z. We assume that there is a function φ : Z 7→ X that converts each instance of Z into a corresponding input-space instance X = φ(Z).

The variable Z can be further decomposed as:

Z = (ZΓ1, ZΓ2, . . . , ZΓk)

where each Γi, (i = 1, . . . , k) denotes a model for part i of the object class/category. Each model

Γi is represented by a d(Γi)-dimensional random vector, which we denote as

ZΓi = (Z 1 Γi, Z 2 Γi, . . . , Z d(Γi) Γi ), i = 1, . . . , k.

Therefore, Z is a ˜d-dimensional random vector where ˜d =Pk_i=1d(Γi).

The model Γi for each object part i can be chosen differently from a set of possible models Γ∗.

For example, when modeling strokes in a handwritten character, each stroke can either be mod- eled as a straight line segment or a curve segment. In this case, let Γ∗ _{= {Γ}

line, Γcurve}. Then

each Γi can either be Γline or Γcurve.

The complete generative model for a particular object class C that consists of k object parts can be fully specified by selecting an object model for each part. We use a k-tuple ΓC_{= hΓ}

1, . . . , Γki

to represent this selection. Again, using the stroke models above, a character with 3 strokes, for example, can have a particular model hΓline, Γline, Γcurvei, indicating that the first 2 strokes are

modeled as straight lines, while the 3rd stroke as a curve segment.

Considering now a binary classification task between some object classes C0 and C1. Let k0

and k1denote the number of parts in each object class respectively, then the space of possible

models for this particular task, MC0,C1, consists of all possible model combinations for C

0 and

C1, that is,

MC0,C1 _{= {hΓ}C0_{, Γ}C1_i}

where ΓC0 and ΓC1 are k₀-tuple and k₁-tuple describing the respective choices of part models for

By decomposing each object into parts and allowing the possibility of modeling each part differently, we have introduced a structure within the model space. By taking advantage of this structure, we can perform adaptive model refinement that does not require the system to evaluate ev- ery possible model and can therefore entertain a much richer space of models.

3.2.3 Model Space and Well-Formed Concepts

Each member of MC0,C1 (i.e. an instance of hΓC0, ΓC1i) can be seen as representing a well-formed

concept discussed in Chapter 2. Each pair of distributions (over Z) for class C0and C1 defines an

implicit decision boundary, which corresponds to the Bayes optimal decision for the task if these were the true task distributions.

From a Bayesian point of view, the search for the optimal model can be seen as optimizing a posterior on the model space:

Pr(M|z) ∝ Pr(z|M)Pr(M)

where M represents a model in M and z is the training data.

Since this is a classification task, the data likelihood, Pr(z|M) should be based on the empiri- cal classification loss and not the data modeling probability within the individual class. For exam- ple, Pr(z|M) = Q_iPr(yi|xi, M) but not Pr(z|M) =

iPr(xi, yi|M). In our algorithm, we directly

estimate Pr(z|M) using the empirical error rate of a classifier trained using the augmented data set.

For Pr(M), it is natural to assign less probability to models that are more complex. Since, by design, we can only reach a more complex model by refining a current one (e.g. by refining the model for a specific object part), we can perform a best-first search on the model space based on the estimated Pr(z|M) alone without the need to evaluate all possible models. Note also that this best-first search is consistent with the general search strategy that we outlined in Chapter 2.

In document Explanation-Based Feature Construction (Page 35-39)