Undergenerating Approximations - Approximate Inference Theory

5.3 Approximate Inference Theory

5.3.1 Undergenerating Approximations

Undergenerating methods approximate argmax_y∈Y by argmax_y∈Y, where Y ⊆ Y; in more conventional language, the maximizing y that they return may not be a global maxima in Y. In Algorithm 1, because the separation oracle is searching a subset of the constraints, at the time of termination there may still remain constraints in OP 4 violated by more than . In this way, use of such a method in the separation oracle may result in a quadratic problem which is underconstrained with respect to the true optimal solution OP 4. In supervised correlation clustering we had the greedy approximation of Section 3.4.1, and in supervised k-means clustering the iterative point-incremental and discretized spectral clustering methods of Section 4.3.3.

In this work dealing with Markov random fields, we consider the following undergenerating methods in the context of MRFs:

Greedy iteratively changes the single variable value yu that would increase net-

work potential most.

Combine picks the assignment y with the highest network potential from both greedy and LBP.

We now theoretically characterize undergenerating learning and prediction. All theorems generalize to any learning problem, not just MRFs. Due to space constraints, provided proofs are proof skeletons.

Since undergenerating approximations can be arbitrarily poor, we must restrict our consideration to a subclass of undergenerating approximations to make mean- ingful theoretical statements. This analysis focuses on ρ-approximation algorithms, with ρ ∈ (0, 1]. What is a ρ-approximation? In our case, for predictive inference, if y∗ = argmax_yhw, Ψ(x, y)i is the true optimum and y0 _{the ρ-approximation}

output, then

ρ · hw, Ψ(x, y∗)i ≤ hw, Ψ(x, y0)i (5.2)

Similarly, for our separation oracle, for y∗ = argmax_yhw, Ψ(x, y)i + ∆(yi, y)

as the true optimum, and if y0 corresponds to the constraint found by our ρ- approximation, we know

ρ [hw, Ψ(x, y∗)i + ∆(yi, y∗)] ≤ hw, Ψ(x, y0)i + ∆(yi, y0) (5.3)

For simplicity, this analysis supposes S contains exactly one training example (x0, y0). However, this is easily generalizable. To generalize, one may view n

training examples as 1 example, where inference consists of n separate processes with combined outputs, etc. In a similar fashion, combined ρ-approximation outputs may be viewed as a single ρ-approximation output. Further, this practice of effectively combining multiple examples into one example reflects the actual implementation of the structural SVM [53].

∆ = maxi,y∈Yk∆(yi, y)k are finite, an undergenerating learner terminates after

adding at most −2(C ¯∆2_R_¯2_{+ n ¯}_{∆) constraints.}

Proof. The original proof holds as it does not depend upon separation oracle qual- ity (Algorithm 1, line 7).

Lemma 1. After line 6 in Algorithm 1, let w be the current model, ˆy the constraint found with the ρ-approximation separation oracle, and ˆξ = H(ˆy) the slack associated with ˆy. Then, w and slack ˆξ +1−ρ_ρ [hw, Ψ(x0, ˆy)i + ∆(y0, ˆy)] is feasible

in OP 4.

Proof. To outline the proof idea, if we knew the true most violated constraint y∗, we would know the minimum ξ∗ such that w, ξ∗ was feasible in OP 4. The proof upper bounds ξ∗.

With a ρ-approximation algorithm as our separation oracle, instead of solving y∗ = argmax_y∈Y∆(y0, y) + hw, Ψ(x0, y)i exactly, we find some ˆy such that

∆(y0, ˆy) + hw, Ψ(x0, ˆy)i ≥ ρ [∆(y0, y∗) + hw, Ψ(x0, y∗)i] (5.4)

Since we did not solve argmax_yH(y) exactly, we have not necessarily found the most violated constraint. In fact, we have underestimated the slack required to make the current model w feasible under OP 4 by exactly this amount.

[∆(y0, y∗) + hw, Ψ(x0, y∗)i] − [∆(y0, ˆy) + hw, Ψ(x0, ˆy)i] (5.5)

The first term of (5.5) is unknown, but we have the benefit of the ρ-approximation bound to help us. We can be certain that we have not underestimated the required

slack by more than

[∆(y0, y∗) + hw, Ψ(x0, y∗)i] − [∆(y0, ˆy) + hw, Ψ(x0, ˆy)i]

≤ 1

ρ[∆(y0, ˆy) + hw, Ψ(x0, ˆy)i] − [∆(y0, ˆy) + hw, Ψ(x0, ˆy)i]

= 1 − ρ

ρ [∆(y0, ˆy) + hw, Ψ(x0, ˆy)i]

So, we know that the true slack ξ∗ required for this example under w obeys

ξ∗ ≤ ˆξ + 1 − ρ

ρ [∆(y0, ˆy) + hw, Ψ(x0, ˆy)i] (5.6)

Since the w is feasible under slack ξ∗, it must also be feasible under this upper bound.

Theorem 9. When iteration ceases with the result w, ξ, if ˆy was the last found most violated constraint, we know that the optimum objective function value v∗ for OP 4 lies in the interval

1 2kwk 2_{+ Cξ ≤ v}∗ _≤ 1 2kwk 2 _{+ C}h1

ρ[hw, Ψ(x0, ˆy)i + ∆(y0, ˆy)] − hw, Ψ(x0, y0)i

Proof. This is simply Lemma 1 applied to the last iteration.

So, even with ρ-approximate separation oracles, one may bound how far off a final solution is from solving OP 4. Sensibly, the better the approximation, i.e., as ρ approaches 1, the tighter the solution bound.

The next result concerns empirical risk. The SVM margin attempts to ensure that high-loss outputs have a low discriminant function value, and ρ-approximations produce outputs within a certain factor of optimum. As seen in Theorem 1, any (w, ξ) solution to OP 4 which is feasible (and not even necessarily optimal) will

have a ξ-based upper bound empirical risk, but only under the condition that h(x) = argmax_y∈Yhw, Ψ(x, y)i, i.e., h(x) does not return an approximation to this argmax but rather the true maximizing argument. Recall that the proof of Theorem 1 depends upon the fact that if ∆(yi, h(xi)) > 0, then it must be that

hw, Ψ(xi, h(xi))i > hw, Ψ(xi, yi)i, leading to the constraint associated with h(xi)

requiring a greater slack. However, if h(xi) does not return a maximizing argument,

this proof falls apart. However, if we suppose h uses a ρ-approximate algorithm for inference, we can say something about the resulting empirical risk

Theorem 10. (ρ-Approximate Empirical Risk) For w, ξ feasible in OP 4 from training with single example (x0, y0), the empirical risk using ρ-approximate

prediction has upper bound (1 − ρ) hw, Ψ(x0, y0)i + ξ.

Proof. The idea of the proof is to take the constraint associated with the output y0 = h(x0) from OP 4 associated constraint, which we must be respecting if we

have a feasible solution, then apply known bounds to the constraint’s hw, Ψ(x0, y0)i

term.

We have a single example (x0, y0), with slack ξ. We know

∆ y0, argmax y hw, Ψ(x0, y)i ≤ ξ, (5.7)

hence the claim that ξ upper bounds empirical risk. The thing is, ξ upper bounds empirical risk only when our prediction function h exactly solves that argmax. However, in general, based on the constraints in OP 4, we know that for any y0 with the feasible solution w, ξ:

∆(y0, y0) ≤ hw, Ψ(x0, y0)i − hw, Ψ(x0, y0)i + ξ. (5.8)

To illustrate the usefulness of this statement, let’s first think of this in the “known separable” case, i.e., we have managed to find a feasible solution to OP 4

such that ξ = 0. In this case, it must be that for our training example (x0, y0), the

y0 is a maximizer, that is, y0 is a valid solution for argmaxyhw, Ψ(x0, y)i, and in

the case where there are multiple optimizers, any such ˆy must have ∆(y0, ˆy) = 0.

In the case where we have a ρ-approximator, whatever such y0 we find from this approximation must have hw, Ψ(x0, y0)i ≥ ρ hw, Ψ(x0, y0)i, and consequently

∆(y0, y0) ≤ (1 − ρ) hw, Ψ(x0, y0)i. So, while ξ no longer necessarily bounds em-

pirical risk when our predictor is a ρ-approximation, the existence of the margin- scaling-by-loss allows us to still say something useful about empirical risk.

The case where ξ > 0, the inseparable (or, more precisely, not provably separable) case is a little more difficult to imagine, but the bound of (5.8) still holds. However, this quantity is known only once we have made a prediction y0, with no information available a priori. However, with some minimum fuss, we can produce a bound.

∆(y0, y0) ≤ hw, Ψ(x0, y0)i − hw, Ψ(x0, y0)i + ξ (5.9)

≤ (1 − ρ) hw, Ψ(x0, y0)i + ξ (5.10)

This last relies upon

hw, Ψ(x0, y0)i ≥ hw, ρΨ(x0, y∗)i (5.11)

≥ hw, ρΨ(x0, y0)i (5.12)

where y∗ = argmax_y∈Yhw, Ψ(x0, y)i. In this way we see that the inseparable case

is similar to the separable case. The theorem comes from (5.10).

If also using undergenerating ρ-approximate training, one may employ Theo- rem 9 to get a feasible ξ.

In document Supervised Clustering With Structural Svms (Page 145-151)