1.4 Unsupervised Learning
1.4.3 Directed vs Undirected Graphical Models
It can be prohibitively expensive to learn a full joint distributions pθ(x), as the naive parametrization tends to grow exponentially in the dimensionality ofx. One can drastically reduce the number of parameters to learn by exploiting conditional independencies present (or assumed to be present) in the data distribution. The field ofgraphical models (Koller and Friedman,2009) formalizes these concepts by combining elements of probability and graph theory. A graphical model represents a probability distribution pθ(x), x ∈ RD through a graph
G = (V, E), where
V ={1,· · · , D} is the set of vertices andE ⊂V ×V the set of edges. Each vertex
Vi represents a random variablexi, while theabsence of an edge between variables encodes conditional independencies present in the underlying distribution. The probability mass function is defined as a product of factors, computed over local subsets of the vertices. There are two broad families of graphical models.
Directed Bayes Networks InDirected Bayes Networks (BN) or directed graph- ical models,pθ(x) is defined as a product of normalized conditional distributions, where the conditional dependencies are captured via a Directed Acyclic Graph (DAG). DenotingVπ(i) to be the set of parents of nodeVi(andxπ(i)the associated random variables), the pdf associated withG is obtained as:
pθ(x) = D � i=1 pθi � xi|xA(i)�. (1.25) One should recognize the chain rule of probability, where some factors in the complete expansion p(x) = p(xD | xD−1, . . . , x1)p(xD−1 | xD−2, . . . , x1). . . p(x2 |
1.4 Unsupervised Learning 20
x1)p(x1) have been zeroed-out due to conditional independencies encoded in the graph structure. Examples of BNs are given in Figure1.4for an arbitrary distribu- tion (left) along with the naive Bayes classifier (middle) which models the input variables (x1, x2, . . . , xD) as being independent when conditioned on the class label y.
Figure1.4 (right) shows the graphical model underpinning several popular fea- ture (or dictionary) learning methods, which model the inputx as the linear com- bination of a latent codeh ∈RN with a feature (or dictionary) matrix W ∈RD×N. More precisely, real-valued inputs are modeled as p(x | h) = N(x|W h+µ,Ψ), with mean vector µ ∈ RD and covariance matrix Ψ ∈ RD×D. Setting the prior p(h) to be Gaussian with an isotropic or diagonal covariance matrix Ψ recovers the famous probabilistic PCA (Tipping and Bishop, 1999) and Factor Analysis algorithms (Basilevsky, 1994) respectively. Sparse Coding (Olshausen and Field,
1996) models the input as the linear combination of a small subset of basis filters ( columns of the weight matrix W). This is achieved by using a sparsity inducing prior on h, such as the Laplace distribution, and setting N >> D. Compared to similar feature learning algorithms presented in Chapter2, Sparse Coding (and directed models in general) benefits from the property ofexplaining away. While
hi and hj are independent random variables, they become dependent when con- ditioning on the input variablex. This is a powerful reasoning mechanism which allows the latent features to “compete” in explaining the input. Unfortunately, this also tends to make inference more complex, often requiring iterative inference al- gorithms. Sampling in BNs is also very straightforward and can be performed in a single top-down pass in a procedure known asancestral sampling. i
Markov Random Fields Undirected graphical models, also known asBoltzmann Machines orMarkov Random Fields (MRFs) with latent variables are the central topic of this thesis. Since edges are undirected, the decomposition pθ(x) cannot rely on a topological ordering of the graph, but instead decomposes the pdf as a product of factors defined over cliques of the graph, i.e. subset of vertices which are fully connected. DenotingxCk as the variables belonging to thek-th clique, the
i. To sample from the model of Fig. 1.4(left), start by sampling from the ancestral nodes, then repeatedly sample from the conditional distributions following a topological ordering of the nodes. The procedure is as follows: x1∼p(x1), x2∼p(x2), x3∼p(x3 |x1 = x1), x4∼p(x4|
pdf can then be written as: pθ(x) = 1 Z K � k=1 ψ(xCk). (1.26)
Here ψ are the potential functions which are constrained to be positive, and Z
is the partition function or normalization constant. Popular examples of MRFs include Ising Models, Conditional Random Fields (Lafferty et al., 2001) and the Restricted and Deep Boltzmann Machines which are the subject of Chapter2.
The formalism of graphical models represents a powerful framework for reason- ing about distributions. Tasks such as computing posterior distributions, MAP estimates or marginals can all be made more efficient by exploiting the structure of the graph. The simplest such algorithm,variable elimination, exploits conditional independencies to determine the ordering in which variables should be summed (or integrated) out. Message passing algorithms exploit dynamic programming to perform these tasks efficiently, by reusing intermediate computations. When G is tree-structured, sum-product and max-product algorithms can compute marginals or MAP (respectively) of full or conditional distributions in a time linear in the number of nodes. Message passing algorithms can also be extended to perform exact inference on general acyclic graphs via the junction tree algorithm (which is exponential in the treewidth of the graph), or approximate inference in general “loopy” graphs using loopy belief propagation. For a more thorough treatment of this material, we refer the reader to Koller and Friedman (2009);Wainwright and Jordan(2008).