Conditional random fields (CRFs) share similarities with M3N learning, in that
both are intended for the class of problem where, given an input pattern x, one finds a labeling y of nodes in a probabilistic graphical model [64, 109].
Given an input x ∈ X and an output y ∈ Y, we have the following w param- eterized distribution for the probability of y given x. (Note that this form is the same as the M3N conditional distribution of (2.36).)
P (y|x; w) = exp [hw, Ψ(x, y)i − z(w|x)] (2.41)
Note that Ψ retains a very similar meaning as in the M3N network, in that Ψ(x, y)
for which we are finding the node labels y, so
Ψ(x, y) = X
k∈cliques(G)
ψ(x, y{k}) (2.42)
Here, y{k} represents the label configuration in y for the nodes in clique k. The
value of the clique potential function for a clique k is therefore w, ψ(x, y{k})
Further, similar to the Z normalizing constant in the M3N conditional distribution,
we have the log partition function z(w|x) = log " X ˆ y exp [hw, Ψ(xi, ˆyi)i] # (2.43)
With this conditional probability for P (y|x; w), we may write the conditional likelihood of the entire training sample S = ((x1, y1), (x2, y2), . . . , (xn, yn)) with
x[S] = x1, x2, . . . , xn and y[S]= y1, y2, . . . , yn as P (y[S]|x[S]; w) = n Y i=1 P (yi|xi; w) = exp " n X i=1 hw, Ψ(xi, yi)i − z(w|xi) # (2.44)
Where a CRF differs substantially from the M3N method is that instead of learning the parameters w with an aim of maximizing margin, what the goal is instead is to find the most likely parameterization w of the model given the training set S, to wit:
P (w|x[S], y[S]) = P (w)P (y[S]|x[S], w) (2.45)
For their prior distribution over the parameters, they choose a zero mean Gaussian P (w) ∝ exp−2σ12kwk
2. The goal in training is to find the most likely parame-
terization w∗ given the training sample S (i.e., the posterior of the parameters), specifically:
w∗ = argmax
w
P (w|x[S], y[S]) (2.46)
How can we calculate this? Note that according to Bayes’ rule,
Let L(w) be the negative log-posterior of the parameters w, specifically:
L(w) = − log P (w|x[S], y[S]) + (some constant) (2.48)
= kwk 2 2σ2 − n X i=1 [hw, Ψ(xi, yi)i − z(w|xi)] (2.49)
As L(w) is the negative log of the posterior, we can maximize this posterior by finding w that minimizes L(w).
The method of minimization employed in CRF training is typically some form of gradient descent on L. The gradient is given as
δ δwL(w) = w σ2 − n X i=1 Ψ(xi, yi) − E z }| { X y∈Y P (y|xi; w)Ψ(xi, y) (2.50)
The interesting portion of computing the gradient at each step is the term labeled E. This sum may be computed in time exponential in the size of the largest clique in the optimally triangularized version of the underlying graphical model G. This marginal term is calculated through the sum/product belief propagation algorithm. This requirement of a marginal over all possible outputs is a weakness of CRFs. In the case of graphical models, we have the sum-product algorithm to compute this marginal, but in other applications, computing a function over all possible inputs may be either intractable, or add complexity to the learning procedure, as it requires another algorithm aside from the inference step.
In the case of graphical models, the sumP
y∈YP (y|xi; w)Ψ(xi, y) in (2.50) may
be computed in time exponential in the size of the largest clique in the optimally triangularized version of the underlying graphical model G. Chains and trees have maximal clique size of 2, but in cases where G has large cliques or loops it will no longer be tractable to do exact computation of the E term. For example, in the case where G takes the form of a grid or lattice (as is common in image processing
applications, for example), exact computation of E for the gradient is no longer possible.
In the case when G is not a general graphical model, one typically employs some form of approximation in computing this gradient, leading to approximate training of model parameters. In [109], a stochastic gradient descent method is employed which makes use of approximations of the gradient. [47] also uses gra- dient descent, utilizing contrastive divergence [48] to approximate the gradient in computing the step at each iteration. Bayesian CRFs, a method closely related to CRFs, in training utilizes an approximation of the posterior of the model param- eters [85]. Discriminative random fields, another method closely related to CRFs, uses psuedolikelihood to estimate model parameters [61, 62].
Another interesting innovation relating to conditional random fields is that it might even be possible for a learning method based on approximate inference to, in some cases, do better than a CRF model built for exact inference, with a locally trained model giving better sequence predictions. In particular, a CRF that is trained in a piecewise fashion in some cases appears to perform better than a globally trained CRF [101]. The ability of a locally and, in some sense, “inexactly” trained sequence model to perform comparably to globally trained models was a feature in [84, 92] as well.
In particular, in [92] is a paper about the use of CRFs for sequence predic- tions in the case where one has constraints on the output that one knows a priori. For example, consider a simple semantic role labeling task, where one has a sen- tence and wishes to discover the verb-argument structure, where each “verb” has a single argument, and each argument itself is one of several types. Then one can have constraints difficult or impossible to include in standard Viterbi: for exam-
ple, one would want exactly one argument label, the active verb is provided as input, various verbs disallow certain types of arguments from being used, etc. The suggestion is to phrase the Viterbi sequence inference procedure as instead being an instance of an integer linear program. The flexibility of being an ILP allows them to include a more general class of constraints than can be accommodated by a Viterbi like algorithm. The paper is interesting and relevant to this work in two respects. First, the constrained inference procedure is not used in training, leading to a machine learning procedure which is, in some respect, “relaxed,” as the eval- uation inference mechanism differs, in some sense, from the inference mechanism for which the training algorithm is trying to optimize. Instead of training for a constrained sequence predictor, they train the model as a vanilla CRF for an un- constrained sequence predictor. Second, going even further, they utilize a training method which does not learn a model as a sequence at all, i.e., effectively just learning a multiclass classifier. Performance of the purely locally trained model without the ILP constraints is quite low, though with the inclusion of constraints the performance rises rapidly, even to the point of exceeding the performance of the “properly” globally trained model once all constraints are active.