Maximum margin Markov networks (M3N) represent a different approach to solve the structural SVM quadratic program in OP 4 [104]. In order to achieve tractabil- ity, it restricts its attention to a significant subcase of structural learning: super- vised learning over Markov networks. In a Markov network, we have an undirected graph G = (V, E) with each node in V corresponding to one in a set of random variables X, and an edge {u, v} ∈ E representing a dependency between the vari- ables u and v, and a collection of non-negative potential functions φk for each
clique k in G. The joint distribution of the network is given as
P (X = x) = 1 Z
Y
k∈cliques(G)
where Z is the normalizing partition function so that the sum of the probabilities of all different assignments to X sums to 1, specifically:
Z =X ˆ x Y k∈cliques(G) φk(ˆx{k}) (2.34)
where ˆx is enumerated over all possible assignments to ˆx. Let us further suppose that all potential functions φk in log space take the form of
log φk(x{k}) =w, ψ(k, x{k})
(2.35) where w is some weight vector shared amongst all the potential functions, and ψ is a function taking two inputs: the clique k, and values for the variables in the clique x{k}. Naturally, when one does induction over this structure to assign
values to variables given a network with potentials, one is interested in finding argmaxxP (X = x).
To give the common canonical example, for the problem of part-of-speech tag- ging with a standard sequence tagger, the nodes V would represent words in a sentence, edges would exist between adjacent words in the sentence, the variable assignments to X would represent the part of speech assigned to each word, and the ψ(k, x{k}) would, in the typical implementation, select out the weights in w rel-
evant to the likelihood that the words in k would have the parts of speech indicated by x{k} and that these two parts of speech would be adjacent.
In this formulation, the familiar x, y input-output pairs are of the form where x represents some structure from which one may induce a Markov network (e.g., the sequence of words in a sentence x inducing a chain Markov network of the same length), and the y represents the variable assignments in that network. Let us restrict our attention to pairwise Markov networks for now (i.e., all cliques are edges). Then, for an input pattern x inducing a graph structure Gx = (Vx, Ex),
recall that the potential for the edge {i, j} ∈ Ex with corresponding variable
assignments yi, yj is φ{i,j} = exp [hw, ψ(i, j, yi, yj)i], with the overall distribution
P (y|x) = 1 Z exp X {i,j}∈Ex hw, ψ(i, j, yi, yj)i = 1 Z exp [hw, Ψ(x, y)i] . (2.36) In the language of the structural SVM, the Ψ(x, y) =P
{i,j}∈Exψ(i, j, yi, yj), with the log probability given as
log P (y|x) = − log Z + X {i,j}∈Ex hw, ψ(i, j, yi, yj)i = − log Z + hw, Ψ(x, y)i , (2.37) so our hypothesis as in the case of the structural SVM is of the form hw(x) =
argmaxyhw, Ψ(x, y)i.
Unlike structural SVMs, M3Ns require a loss function ∆(y, ˆy), which decom- poses over elements in y and ˆy. As Ψ is a sum of local feature functions ψ, for a given input pattern x, ∆ becomes a sum of local losses δ over all vertices i ∈ Vx,
with
∆(y, ˆy) = X
i∈Vx
δ(i, yi, ˆyi) (2.38)
as the proportion of predictions within y and ˆy that differ between teh two inputs, that is, δ(i, yi, ˆyi) = |V1x|1yi=ˆyi, where 1· is the indicator function returning 1 or 0 if its input is true or false, respectively.
In the full structural SVM quadratic program, we have one dual variable αx(y)
for every wrong labeling y of every example x. While the work of [106] deals with this exponentially sized body of constraints by iteratively selecting and introducing the dual variables associated with the most violated constraint, in contrast, the work of [104] takes advantage of the special structure of the Markov network and re- formulates the dual program with “marginal” dual variables µx(yi) =
P
and µx(yi, yj) =Py∼[yi,yj]αx(y). Here, y ∼ [yi, yj] denotes the set of all labelings
y with the variable assignments yi, yj in positions i, j, respectively. Given our
training set S, we can then pose an alternate dual quadratic program as follows:
max X (xi,yi)∈S X u∈Vx X yu µxi(yu)δ(u, yiu, yu) −1 2 X (xi,yi), (xj,yj)∈S X (u,v)∈Exi yu,yv X (r,s)∈Exj yr,ys µxi(yu, yv)µxj(yr, ys) hψ(u, v, yu, yv), ψ(r, s, yr, ys)i s.t. P yuµx(yu, yv) = µx(yv), P yuµx(yu) = C, µx(yu, yv) ≥ 0 .
In this formulation, we now have a number of dual variables polynomial in the length of the sequences and number of possible local labelings, and in the event where the Markov networks together form a forest, this formulation reaches the same solution as the original structural quadratic program.
In the event where one has 3-cliques, one can introduce even more marginal dual variables defined over these cliques, and with loops, one can “triangularize” the dependency graph. Of course, triangularization and subsequent introduction of 3-clique dual variables leads to an exponential number of dual variables in the size of both loops and cliques, but on certain classes of problems, the loops and cliques are small enough so that this is a reasonable suggestion. However, in a case where the graphical model holds a larger clique, or a very large loop as is common in some applications, the number of variables required in the optimization problem can become very large to the point where solving the problem becomes intractable.
The suggestion in this intractable case is to simply solve the QP with its pair- wise marginal dual variables, as a “relaxed” version of the full problem, e.g., ignore any loops and just focus on enforcing local consistency. Though the theoretical guarantees of equivalence to OP 4 no longer hold, they empirically demonstrate the effectiveness of this method on the WebKB data [104]. In this problem each node
represents a web page, and each edge represents a link between the two pages. The web pages do not comprise a tree nor a graph that can be tractably triangularized, so the collective classification of the web pages relies upon the workings of this relaxation.
Closely related work features a grid Markov Random Field employed to segment 3D scan data, with model parameters used through a max margin framework [4]. In this work, they take the original OP 4. They reformulate the “family” of linear constraints consisting of a single constraint for each possible wrong answer
∀i, ∀y ∈ Y \ yi : hw, Ψ(xi, yi)i ≥ hw, Ψ(xi, y)i + ∆(yi, y) − ξi (2.39)
and reformlate it into the single non-linear constraint ∀i : hw, Ψ(xi, yi)i + ξi ≥ max
y∈Y\yi
(hw, Ψ(xi, y)i + ∆(yi, y)) (2.40)
This constraint has the inference procedure in the maximization term. In this case, the maximization procedure for the Markov random field can be shown to be equivalent to an integer linear program, which is relaxed to a real LP. By “fold- ing” this LP back into the non-linear term of the constraint, with some algebraic manipulation the authors derive a modified quadratic program that implicitly has the non-linear constraint. Of course, a real relaxation to compute this max term would produce an answer greater than or equal to the original integer linear pro- gram, leading to a QP possibly “overconstrained” with respect to OP 4. Though used specifically for the scan segmentation problem setting, this “folding” strategy could be used in any structural learning problem with an inference mechanism that can be expressed as a linear program, in line with [103]. Mathematically speaking, the resulting learning algorithm should be mathematically equivalent to our learn- ing method for the special case where the separation oracle is computed though a linear-program.
Some applications that utilize methods derived from M3N include sequence
tagging [104], image segmentation [4], alignment models for translation [63, 70], and general translation [69].
Related to M3N networks are maximum margin Bayesian networks [44]. Such methods based on directed models must satisfy normalization constraints that M3N’s, based on undirected Markov fields, need not obey, i.e., some of the probabil- ities must sum to 1. Though with general network topologies parameter inference in training and inference with the models is approximate, they do show improved performance when the directedness of the model encodes valuable information.