5.5 Wishful Query Projection
5.5.2 Directed Mistake Volume Minimization (DMVM)
Next I introduce DMVM: an approximation for DEER. Then, I derive a performance guar-antee for DMVM and show how it can be specified for an example uncertainty represen-tation. While I do further discuss DMVM in Chapter6 where I include it in some of the experiments, my focus will be on DEER, so if desired the reader can skip this subsection without risk of missing the core content of the remainder of the dissertation.
DEER’s second step, which requires performing an entropy computation for every pos-sible query, can be approximated by minimizing a geometrical criterion that may be cheaper to compute than expected posterior response-entropy for some settings. In particular, Di-rected Mistake Volume Minimization (DMVM) approximates DEER’s second step as fol-lows:
q = arg min
q∈QM (d∗|q), where M (d∗|q) =
Z
Ω k
X
j=1
Pr(d∗ = j|ω)δ(Jd∗(ω) 6= f (d∗|q = j))dω, Jd∗(ω) = arg max
j Pr(d∗ = j|ω), and f (d∗|q = j) = arg max
i Pr(d∗|q = j).
Instead of selecting the query that minimizes the expected posterior entropy over responses to d∗, the second step of DMVM selects the query q that minimizes M (d∗|q): the Mistake Volumeassociated with using q to match d∗.
To illustrate what mistake volume measures, consider the straightforward case where queries are binary (k=2) and have deterministic responses when conditioned on ω, and without loss of generality denote the possible responses to either query as “yes” and “no”
. Observing the response “yes” to q, then, will eliminate all ω that are not in the subset Ωq,“yes” ⊆ Ω containing the ω where Pr(q = “yes”|ω) = 1.0, as the rest are inconsistent with the observed response. Then, Ωq,“yes”is split into two subsets: one subset where the response of d∗ would be “yes”, and one where the response would be “no”. The response corresponding to the subset with higher probability mass than the other, then, would be the best response prediction of d∗after observing response “yes” to q, which would correspond above to f (d∗; q = “yes”) above. Similarly, there is a best response prediction of d∗ when the response for q is observed to be “no”, which would correspond to f (d∗; q = “no”).
Mistake volume, then, would correspond to the volume of the subset of Ω containing those ω where, given ω, the resulting best response prediction of d∗ upon observing the response to q does not match what the response to d∗would be.
The computational efficiency of DMVM compared to DEER heavily depends on the nature of the askable query set Q, the functional form governing decision values Vωu, and the geometric structure of the model space Ω. In particular, DMVM needs to compute weighted volumes over Ω, which could be challenging or even infeasible to compute ex-actly in high-dimensional spaces (for example, computing the volume of a convex polytope defined by a set of linear inequalities is #P-hard (Dyer and Frieze,1988)). While in general DMVM may not afford computational advantages over DEER, in Chapter6I will study the empirical performance of a particularly efficient algorithm whose query selection criterion is motivated by that of DMVM’s, in a setting similar to the illustration described above.
5.5.2.1 DMVM EVOI-loss
Next I derive a bound similar to the one stated in Theorem5.5that is a function of DMVM’s geometrical similarity criterion for queries instead of DEER’s entropy-based similarity cri-terion. Applying Fano’s inequality (Fano and Wintringham, 1961), H(d∗|q) can be re-placed with a quantity that is specific to a function or algorithm used to predict responses to queries in Dk∗ given a response to a query in Q. Let f (d∗; q = j) be the function that predicts the response to d∗ that has highest probability conditioned on the response to q, i.e., f (d∗|q = j) = arg maxiPr(d∗ = i|q = j; ψ). Then,
Lemma 5.8. For any d∗ ∈ Dk∗andq ∈ Qk,
H(d∗|q) ≤ Ej∼ψ;q Pr(d∗ 6= f (d∗; q = j)|q = j); ψ log(k) + 1.
Proof. Let b(d∗; q) denote the Bernoulli distribution where a success corresponds to the event that f (d∗; q = j) predicts the correct response to d∗, i.e., p = Pr(f (d∗; q = j) = d∗).
By Fano’s inequality,
H(d∗|q) ≤ H b(d∗; q) + Ej∼ψ;q Pr(d∗ 6= f (q, j)|q = j); ψ log(k − 1), which in turn implies that
H(d∗|q) ≤ Ej∼ψ;q Pr(d∗ 6= f (d∗; q = j)|q = j); ψ log(k) + 1.
Next I further factor H(d∗|q) so that it becomes a function of the mistake volume associated with q and d∗, which leads directly to an EVOI-loss upper bound for DMVM. To do so, I need to introduce some additional definitions:
• Let the Response Function Jd∗(ω) for d∗ ∈ Dk∗ and ω ∈ Ω be defined as
Jd∗(ω) = arg max
j Pr(d∗ = j|ω).
Note that Jd∗(ω) is deterministic since decision query responses are deterministic conditioned on ω. I will utilize this fact below.
• Let the Mistake Volume associated with using the response to q to predict the response
Note that mistake volume is the criterion that DMVM minimizes in order to project d∗ into the askable query set.
• Let Rz denote the set of all subsets of Ω such that for all R ∈ Rz,R
Combining this inequality with Lemma5.8yields
H(d∗|q) ≤ pΩ(M (d∗, q); ψ) log(k) + 1, (5.4) which can be understood as an upper bound for H(d∗|q) as a function pΩ, which governs how concentrated the mass of ψ is in a geometric sense (which is independent of q and d∗), and as a function of the mistake volume associated with matching d∗with q, which DMVM minimizes to project d∗ into the askable query set.
Analagous to how H(Q, D∗k) measures similarity between Q and Dk∗in terms of posterior-response entropy, define M (Q, D∗k) as
M (Q, D∗k) = min
q∈Q max
d∗∈Dk∗M (d∗, q),
which measures similarity between Q and D∗k in terms of mistake volume. Applying this definition yields Theorem5.9below:
Theorem 5.9. (Upper bound on DMVM EVOI-loss.) For any askable k-response query set Q, if DMVM selects q, then
max
The last inequality follows since DMVM selects q so as to minimize M (q, d∗) and pΩ is monotonically nondecreasing.
EVOI-loss guarantee for greedy approximation of DMVM. Recall that when DEER’s first step is approximated by greedy construction followed by recursive query improvement until convergence, the EVOI-loss of the query selected can be no more than (Vmax−Ve min) more than the EVOI-loss of the query selected by the exact version of DEER (Theorem5.7).
When DMVM’s first step is approximated in the same manner, the same increase in maxi-mum EVOI-loss applies, and the proof is analogous to that of Theorem5.7.
Special Case: Gaussian Uncertainty. Here I show an example for how Theorem5.9can be specified for a simple example uncertainty representation. In particular, let ψ take the form of a Gaussian distribution with mean µ and variance σ2. Then,
pΩ M (Q, D∗k)
The first equality above holds due to the combination of the facts that 1) the Gaussian pdf is
symmetric over its mean; and 2) the Gaussian pdf is monotonically decreasing in absolute distance from the mean. The second equality holds because for any n,
Pr
the n in this case is obtained by setting M (Q,D
∗ k)
2 = nσ and solving for n. Plugging this result into Theorem5.9yields an EVOI-loss bound for DMVM that applies to the special case of single-dimensional Gaussian uncertainty:
Lemma 5.10. Let the askable query set Q, and let ψ take the form of a Gaussian distribu-tion with varianceσ2. Then if DMVM selectsq,
maxq∗∈QEV OI(q∗; ψ) − EV OI(q; ψ) ≤
Note that since erf(x) is monotonically increasing in x, Equation5.5implies that pΩ(M (Q, D∗k)) is monotonically increasing in M (Q, D∗k) while monotonically decreasing in σ. This im-plies that for Gaussian distributions, the error bound stated in Lemma5.10prescribes lower worst-case EVOI-loss as 1) D∗kand Q become more related in terms of maximum mistake volume; and 2) the variance of the agent’s uncertainty increases. Intuitively, this can be understood as follows: the more specific the region of Ω the agent needs to learn about, the stronger the required connection between Dk∗ and Q (in terms of maximum mistake volume) in order to achieve the same upper bound on EVOI loss.