Markov Networks and Conditional Random Fields

Markov random fields (MRFs) are a type of probabilistic graphical model and the undirected counterpart of Bayesian networks. Contrary to the latter, the edges between variables are undirected, which is typically more natural for problems in image analysis and additionally allows one to model cyclic dependencies. MRFs, in their general form, model the joint distributionP(Y)of some set of variablesY= (Y1, . . . , YM),

which typically represent classes of a classification problem. These can be represented as a product of potential functionsψ(·)acting on subsetsξk(Y)of variables, whereξk:Y 7→ YkandYk⊆ Y:

P(Y|Θ) = 1 Z(Θ) K Y k=1 ψk(ξk(Y)) (7.1) whereZ(Θ) =P

YP˜(Y; Θ)a normalization constant known as the partition function. Two types of potentials are often employed for image analysis problems: unary or singleton potentials acting on a single variable and binary or interaction potentials that capture co-occurrence statistics between variables.

Without loss of generality, the potentials can be represented as an exponentiated linear combination of feature functionsφk(·)and model parametersθkT= (θ1, . . . , θM):

ψk(ξk(Y)) = exp{θTφk(ξk(Y))}

resulting in a model that can be seen as a structured extension of logistic regression, where instead of a distribution over a single output variable, a joint distribution over a set of variables is learned. The feature functions are often binary mappings but can take any form. Typical functions for segmentations problems

CHAPTER 7. AN INTEGRATIVE PROBABILISTIC FRAMEWORK FOR COMPUTER AIDED DETECTION OF BREAST CANCER IN MAMMOGRAPHY

in vision are designed to enforce consistency among neighboring pixels, such asφ(yi, yj) =1{yi=yj},

with1{·}the binary indicator function and theyvariables representing pixels in the image.

The CRF model [161, 258], also sometimes referred to as a discriminative random field [159] is a specific type of MRF that assumes every variableYkin the model is conditioned on an inputXk. The

main advantage in this setting is that parameters in potentials can be trained discriminatively using models like deep CNNs, in which caseXk is an input patch. This is advantageous if the underlying generative

model is complex, but the class posterior relatively simple [159] as in the case of images. Although similar, two computational problems are typically distinguished and are relevant for our application:inferenceand

learning.

7.2.1 Inference

Inference algorithms are divided intosamplingbased methods that use monte-carlo techniques to approximate the true posterior andvariationalmethods that give an exact solution to a tractable surrogate of the true distribution. Both directed and undirected models can be represented in the form of afactor graph[157]: bipartite graphs comprising variable and factor nodes, expediting the generalization of inference algorithms. A common inference problem is computing marginals: given a joint distributionP(Y)over a set of random variablesY ={Ym}Mm=1, compute the distributionP(Ym) =PY¬mP(Y)over individual variable Ym. These values are needed in our model to eventually generate image based labels. Marginals can simply

be computed by summing out all other variables in the distribution. However, the time complexity of this operation is exponential in the amount of variables in the graph and therefore often not possible in practice for all but the smallest models.

Belief propagation [203, 292] is a type of variational inference introduced to efficiently compute marginals [292] and reduces the complexity of the computation from exponential to linear in the amount of variables in the graph. It is phrased as a recursive algorithm that sends messages between nodes in the graph about instantiationsymof a variableYm. In the case of a factor graph, two type of operations are performed: (1)

a variablemto a factorkmessage:

µm→k(ym) =

k0_∈_N₍_m₎_¬_k

µk0_→_m(y_m) (7.2)

whereN(m)¬kgenerates the set of all factors containing variablem, excludingk

and (2) factor to variable message:

µk→m(ym) = X y∈ξk(Y)¬Ym ψk(ξk(Y)) Y k0∈N(k) µk0→m(ym) (7.3)

with againN(k)a neighborhood generating function, this time returning all variables in the neighborhood. This algorithm will output refined scores for each variable in the model, that take into account any co- occurrence relations and all factors in the model.

7.2.2 Learning

Maximum Likelihood Estimation (MLE) is the most commonly used technique to train PGMs. In the fully observed case, the log-likelihood of parametersΘconditioned on a datasetD ={Xn,Yn}Nn=1under a

CRF is given by: log[L(Θ;D)] = 1 N N X n=1 K X k=1 ψk(ξk(Yn))−log Z(Θ;Xn) (7.4)

CHAPTER 7. AN INTEGRATIVE PROBABILISTIC FRAMEWORK FOR COMPUTER AIDED DETECTION OF BREAST CANCER IN MAMMOGRAPHY

where samples are assumed to be iid. Taking partial derivatives with respect to parameters in the model results in the difference between what is referred to as theclampedandcontrastiveterm:

clamped term z }| { 1 N N X n=1 φk(ξk(Yn);θk)− contrastive term z }| { X Y p(Y|X; Θ)φk(ξk(Yn);θk) (7.5) Since_N1 PN

n=1φk(ξk(Yn)) =ED[φk(ξk(Yn))]the expectation of the feature in the data and

Yp(Y|X; Θ)φk(ξk(Yn)) =EΘ[φk(ξk(Yn))]the expectation of the model, this process is also referred

to as moment matching. The CRFs loss function is convex, but has no closed form solution and hence it- erative methods, in particular variations on Gradient Descent are applied to get the optimal set of parameters. The contrastive term in equation (7.5) is exponential in the number of variablesKin the graph and due to the dependence on the input in the CRFs formulation, needs to be performed for every training step, rendering learning slow or intractable for large graphs with many edges. Several approximate learning methods [200] have been proposed.

7.2.3 Approximate learning

Popular approximate learning methods include Pseudo-Likelihood (PL) [18], Contrastive Divergence (CD) [118, 34, 287] and piecewise training [259]. Pseudo-likelihood reduced the complexity to polynomial by assuming that all variables are observed during training. The likelihood is estimated by conditioning all variables on its observed neighbors and subsequently taking an average:

P(Y|X; Θ) = M

n=1

PP L(Yk|Y¬k,X; Θ) (7.6)

Since the normalization constant now depends on one variable only, the complexity reduces from exponential to linear in the amount of variables in the graph.

In document Computer aided diagnosis of breast cancer in mammography using deep neural networks (Page 88-90)