Relation to Conditional Max-Entropy Models

5.2 Conditional Exponential Family Models

5.2.2 Relation to Conditional Max-Entropy Models

A max-entropy model is an element of a hypothesis class of probability density functions which satisfies a linear constraint on the first moment of the sample with the maximal entropy. The aim of such an estimation procedure is to objectively encode the information from the sample into the model. Jaynes (1957) showed that exponential family models are max-entropy models subject to a linear constraint on the first moment of the sample. This result was extended to conditional exponential family models and generalized to support different ‚divergence measures’ (Altun and Smola, 2006). We present here an instance of the latter work adapted to the maximization of conditional entropy subject to a constrain on the first moment of the sample. For that, let P denote the set of all conditional distributions that have square integrable densities with respect to some base measure µ defined on X × Y and support on the entire domain of a sufficient statistics φ(x,y). More formally, P ⊂ L2

µ(X × Y)

where L2

µ(X × Y) is the Hilbert space of square integrable functions defined on X × Y. Thus,

the inner product between f ,g ∈ P can be defined as hf ,gi_L2 µ =

X ×Yf (y| x)g (y | x)dµ. In

the remainder of the chapter, we will use supp(·) to denote the support of a probability distribution and µ_X to denote the marginal distribution of a measure µ defined on X × Y. Definition 5.1. The conditional entropy of a conditional density functionp∈ P with respect to a marginal probability density functionq defined on supp (µ_X) ⊆ X is defined as

H (p| q) = − Z

Xq (x)

Yp (y| x)lnp (y | x)dµ . (5.5)

Denote the empirical expectation of a feature vector φ(x,y) with respect to an inde- pendent sample z = {(xi, yi)}ni=1from the probability distribution ρ by φ =1n

Then, a max-entropy distribution from P satisfying a linear constraint on the first moment of the sample is a solution of the following optimization problem

argmax p∈P H (p| ρX) s.t. Eρ_XEp[φ (x,y) | x] − φ _H≤ ε , (5.6) where ε ≥ 0 is a hyperparameter chosen such that there exists a non-empty subset of P satisfying the linear constraint. The hyperparameter also controls the quality of the matching between the first moment of the sample and that of the conditional probability density function p. As the following proposition will show, if the optimal solution to this problem exists then it can be represented as a conditional exponential family model. If the solution exists for ε = 0, then it is sufficient to find the maximum likelihood estimator from the conditional exponential family. On the other hand, if the optimal solution exists for ε > 0 then it can be found by performing the maximum a posteriori estimation with a zero-mean Laplace prior on the parameter vector of the model. Before formally stating these results, we review the required terminology from convex analysis.

Definition 5.2. LetV be a Hilbert space. The convex conjugate of a function f : V → R is

f∗: V → R, where f∗_{is defined as}

f∗(ξ) = sup

v∈V hξ,viV− f (v) .

The effective domain of f is the set dom(f ) = {v ∈ V | f (v) < ∞}. The epigraph of f is the set of points above its graph, epi(f ) = {(v,r) ∈ V × R | f (v) ≤ r}. If the epigraph of a function is a convex/closed set (with respect to the Hilbert topology on V × R) then the function is called convex/closed. A convex function f is called proper if there exists v ∈ V such that f (v) < ∞. A proper convex function is closed if and only if it is lower-semicontinuous, i.e., liminf_v0_→vf (v0) ≥ f (v). The algebraic interior of a set A ⊆ V is defined as

core(A) = {a ∈ A | (∀v ∈ V)(∃tv > 0) :∀t ∈ [0,tv] a + tv ∈ A} .

If A is a convex set with non-empty interior then core(A) = int(A), where int(A) is the set of all interior points of A. Having reviewed the relevant terminology from convex analysis, we now state a variant of Fenchel’s duality theorem that relates a convex minimization problem to the concave maximization using conjugates. Fenchel’s theorem will then be used to relate the optimization problem in Eq. (5.6) to the maximum a posteriori estimation with a conditional exponential family model as the likelihood function. While the following variant of the theorem is for clarity reasons restricted to Hilbert spaces, the general result holds for Banach spaces (e.g., see Rockafellar, 1966).

Theorem 5.2. (Fenchel’s Duality Theorem, Rockafellar, 1966; Altun and Smola, 2006) Let

L : V1 → V2 be a bounded linear operator between Hilbert spacesV1 andV2. Suppose that

f :V1→ R and g : V2→ R are proper convex functions. Denote with

m = inf

v∈V1 f (v) + g (Lv) ∧ M = sup_ξ∈V₂ −f

∗_(L∗_ξ)_{− g}∗_{(−ξ) .}

Assume thatf , g, and L satisfy one of the following two conditions:

5.2 Conditional Exponential Family Models 133

ii) L dom (f )∩ cont(g) , ∅,

where cont(g) denotes the subset of the domain where g is continuous. Then, it holds that

m = M, where the dual solution M is attainable if it is finite.

Having reviewed Fenchel’s duality theorem, we are now ready to formally state a result relating the max-entropy estimation subject to empirical constrains on the first moments of the sample to the maximum a posteriori estimation using a conditional exponential family model as the likelihood function. The following proposition is a minor adaptation of a result by Altun and Smola (2006) that shows the equivalence of the two estimation problems. Theorem 5.3. Suppose there exists a constantr > 0 such that

φ (x, y )

_H< r for all (x, y)∈ X ×Y and let ε ≥ 0 be a hyperparameter such that there exists an optimal solution in the interior ofP for the optimization problem in Eq. (5.6). Then, we have

min p∈P −Hp | ρemp subject to EρempEp[φ (x,y) | x] − φ _H≤ ε = max θ∈H 1 n n X i=1 φ (x_i, y_i),θ − ln Z Yexp(φ (xi, y), θ) dν− ε kθkH (5.7) whereρemp= 1n Pn

i=1Ix=xi and Ix=xi is one whenx = xi and zero otherwise.

Proof. The proof follows along the lines of the work by Altun and Smola (2006) with the main difference being the use of conditional entropy instead of the Kullback–Leibler divergence. Beside this, the theorem defines the first moment constraints slightly differently compared to Altun and Smola (2006), where the expectation is not taken over a density function on X.

Denote with B = h ∈ H | khkH≤ 1 the unit ball centered at the origin of H and let

L_φ: L2

µ(X × Y) → H with Lφ(p) = EρXEp[φ (x,y) | x] be an operator mapping P to the Hilbert space H. This operator is linear because the expectation operator is linear and it is bounded because φ (x, y )

< r for all (x, y )∈ X × Y. For all ε ≥ 0 let

φ + εB =nh ∈ H | h = φ + εb ∧ kbkH≤ 1o .

Then, we have that nL_φ_{(p) ∈ H |} Lφ(p) − φ _H≤ ε ∧ p ∈ Po = nLφ(p) ∈ H | Lφ(p) ∈ φ + εB ∧ p ∈ Po . Now, in order to shift the hard constraint from the max-entropy optimization problem (see Eq. 5.6) into the objective let us introduce a characteristic function g : H → R such that

g (h) =        0 if h ∈ φ + εB ∞ otherwise .

On the one hand, functions f = −H (· | ρX)and g are proper closed convex functions (e.g., see

Boyd and Vandenberghe, 2004; Dudík and Schapire, 2006) and, thus, lower-semicontinuous. On the other hand, we have that dom(g) = φ + εB, where ε ≥ 0 is chosen so that 0 ∈ core_{dom(g) − L}_φ(dom(f ))

. Hence, the first condition from Theorem 5.2 is satisfied and we have that the primal and dual solutions are identical.

To complete the proof, we need to derive the convex conjugates for functions f and g. The convex conjugate of function g is given by

g∗(θ) = sup

ˆh nDθ, ˆhEH| ˆh ∈ φ + εBo = Dθ,φEH+ ε supkhkH≤1

Now, observe that for all h ∈ H and all p ∈ L2 µ(X × Y) we have that Dh,L_φpE H= Z Xρ (x) Z Y h, φ (x, y)_Hp (y| x)dµ =DL∗ φh, p E L2 µ , where L∗

φh = ρ (x) h, φ (x, y)Hand L∗φis the adjoint of the linear operator Lφ. Denote with

f (θ) = fL∗_φθ, where θ ∈ H and L∗_φθ∈ P . Then, the convex conjugate of ˆf is defined as ˆ f∗(θ) = sup p∈P Dθ,Lφp E H− Z Xρ (x) Z Yp (y| x)lnp (y | x)dµ .

To solve this linearly constrained optimization problem over the Hilbert space L2

µ(X × Y)

and compute ˆf∗_{in closed form, we first form the Lagrange function}

L(p,λ) = −Dθ,L_φpE H+ Z Xρ (x) Z Yp (y| x)lnp (y | x)dµ + Z Xλ (x) Z Y(p (y | x) − 1)dµ ,

where λ(x) ≥ 0 for all x ∈ supp(µX). As the optimal solution to the primal problem exists,

we can compute the functional gradient of this Lagrange functional and find its optimal solution by setting the gradient to zero. Before we proceed with this, let us introduce the notion of functional gradient in Hilbert spaces.

For a functional F defined on a Hilbert space and an element p from that space, the functional gradient ∇F (p) is the principal linear part of a change in F after it is perturbed in the direction of q, i.e.,

F (p + ηq) = F (p) + ηh∇F,qi +η2_,

where η → 0 (e.g., see Secton 3.2 in Gelfand and Fomin, 1963). Applying the definition of functional gradient to L we obtain that

∇pL = −ρ (x)

θ, φ (x, y)_H− lnp (y | x) − 1 + λ(x) . From here it then follows that

∇pL = 0 =⇒ p∗(y | x) =

exp

θ, φ (x, y)_H

exp(λ(x)_/_ρ(x)_{+ 1)} .

Now, taking λ(x) = ρ(x)lnR_Yexp

θ, φ (x, y0)_H

dν− 1

the constraint p∗_{∈ P is satisfied}

and the convex conjugate of ˆf is given by ˆ

f∗(θ) = EρX[A(θ | x)] .

Plugging the two convex conjugates into the dual problem from Theorem 5.2 and setting

ρ (x) = ρemp(x) we obtain M = max θ∈H 1 n n X i=1 φ (x_i, y_i),θ_H− A(θ | xi) − εkθkH .

5.2 Conditional Exponential Family Models 135 Theorem 5.3 guarantees that conditional exponential family models are objectively en- coding the information from the sample into the model. In fact, any other choice of the conditional model makes additional assumptions about the samples that reduce the entropy and introduces a potentially undesirable bias into the process. As already pointed out, the theorem shows that if a solution exists for ε = 0 then fitting of a max-entropy model is equivalent to the maximum likelihood estimation using conditional exponential family of models. Moreover, if a solution exists for ε > 0 then the max-entropy estimation is equivalent to the maximum a posteriori estimation with a conditional exponential family model as the likelihood function and the Laplace prior on the parameter vector. As the following corollary shows, the latter is equivalent to the maximum a posteriori estimation with the zero-mean Gaussian prior on the parameter vector.

Corollary 5.4. (Altun and Smola, 2007) Letε > 0 be a hyperparameter value such that there

exists a solution to the problem in Eq. (5.6). Then, there existsλ > 0 such that

argmax θ∈H 1 n n X i=1 φ (xi, yi),θ_H− A(θ | xi) − εkθkH = argmax θ∈H 1 n n X i=1 φ (xi, yi),θ_H− A(θ | xi) − λkθk2_H .

Corollary 5.4 allows us to relate conditional exponential family models and max-entropy estimation to Gaussian processes (Mackay, 1997; Rasmussen and Williams, 2005). The following section reviews this connection first observed by Altun et al. (2004).

In document Constructive Approximation and Learning by Greedy Algorithms (Page 147-151)